SlideShare a Scribd company logo
1 of 74
Download to read offline
Jinka University
Yebelay, M.
Chapter 1: Introduction
 Statistics: A field of study concerned with:
 collection, organization, analysis, summarization and
interpretation of numerical data, &
 the drawing of inferences about a body of data when only a small
part of the data is observed.
 The subject of statistics covers:
 the design of a study
 the collection of data
 the analysis of data
 the presentation of suitably summarized information, often in a
graphical or tabular form
 The interpretation of the analyses in a manner which
communicates the findings accurately
 Biostatistics: it is the application of statistical methods to the
fields of biological and medical sciences.
 Concerned with interpretation of biological data & the
communication of information derived from these data
 Has central role in medical investigations
 The numbers must be presented in such a way that valid
interpretations are possible
Why statistics?
 Do research and publish scientific literature
 Integral component of epidemiology
 Risk analysis and predictions
 Analysis of data from diagnostic services
 Analysis of data from pharmaceutical and agrochemical
industries
 Safety and quality of food for human consumption
Uses of Biostatistics
• Provide methods of organizing information
• Assessment of health status
• Health program evaluation
• Resource allocation
• Magnitude of association
– Strong vs weak association between exposure and
outcome
Example: Feeding vs Production
Health vs Production
Uses of biostatistics
• Assessing risk factors
– Cause & effect relationship (Eg, Environment/Housing vs Production)
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of animal free from
the disease is greater among the vaccinated than the
unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population
What does biostatistics cover?
Research Planning
Design
Execution (Data collection)
Data Processing
Data Analysis
Presentation
Interpretation
Publication
Biostatistical
thinking
contribute in
every step in a
research
The best way to
learn about
biostatistics is to
follow the flow of a
research from
inception to the
final publication
Analysis
• Analysis part is the major part of learning about biostatistics
– There are dozens of different methods of analysis, which
makes difficult the choice of the correct method for a
particular case
– It is necessary to consider the philosophy that underlies all
methods of analysis:
• Use data from a sample to draw inference about a wider
population
Analysis
 The raw data are meaningless unless certain statistical treatment
is given to them.
 Analysis of data means to make the raw data meaningful or to
draw some information from the data
 Thus, the analysis of data serves the following main functions:
• To make the raw data meaningful
• To test null hypothesis
• To test the significance
• To draw some inferences or make generalization
• To estimate parameters (sample statiscts and population
parameters)
Interpretation
• Interpretation of results of statistical analysis is not always
straightforward, but is simpler when the study has a clear
aim.
• If the study has been well designed and correctly analyzed
the interpretation of results can be fairly simple.
Types of Statistics
1. Descriptive statistics:
 Ways of organizing and summarizing data
 Helps to identify the general features and trends in a set of
data and extracting useful information
 Also very important in conveying the final results of a study
Example: tables, graphs, numerical summary measures
Types of Statistics
2. Inferential statistics:
• Methods used for drawing conclusions about a population
based on the information obtained from a sample of
observations drawn from that population
Example: Principles of probability, estimation, confidence
interval, comparison of two or more means or
proportions, hypothesis testing, etc.
Statistical variables and data
• A variable is a set of observations on a particular character that
can take values which vary from individual to individual or group
to group,
• e.g. height, weight, housing, blood count, enzyme activity,
coat colour, percentage of a flock which are pregnant, which
are diseased etc…
• Data are records of measurement, counts or observations of
variables.
• Examples of data are records of weights of calves in kg, milk
yield of cows in liter, male or female sex, and black or white
coat color of cattle.
Types of Data
1. Primary data: collected from the items or individual
respondents directly by the researcher for the purpose of a
study.
2. Secondary data: which had been collected by certain people
or organization, & statistically treated and the information
contained in it is used for other purpose by other people.
• Can be obtained from:
– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation etc…
Types of variables
 Categorical (qualitative)
 Nominal scale
 Ordinal scale
• Nominal scale (classification or group): the distinct categories
which define the variables are unordered and each can be assigned
a name,
• It has has categories that cannot be ranked.
e.g. coat colors (black, white)
 Sex (male or female)
 Breed (local, exotic)
 Numerical (quantitative)
 Discrete variable
 Continous variable
Types of variable…
• Ordinal scale (ranked variables: small, medium, large): the
categories which constitute the variable have some intrinsic order
but there are no consistent and defined intervals between the
various categories;
• An ordinal variable has categories that can be ranked.
For example: Body condition scores, Degree of vigor of
motility of larvae.
 These “scales” are often given numerical values 1 to n. However,
the differences among those numbers do not have numerical
meaning.
Those scores depict categories, but not a numerical scale.
Types of variable…
Quantitative variables
• Consisting of numerical values (true numerical measurement)
on a well defined scale (measurement unit).
• Quantitative data relate to amounts, rather than just indicating
classes,
• These data may be further divided into: discrete and
continuous.
Types of variable…
• Discrete variable can have only one of a specified set of values,
such as whole numbers. Discrete data often generate counts, i.e. it
is countable. for example:-
• The number of ticks collected from animals
• Number of animals per households
• Number of parasite eggs per gram of feces
• Continuous variable theoretically may have any value within a
defined range and potentially can take any value between intervals
(though the range can be infinite). Examples are body weight,
height, milk yield, temperature, and antibody titre.
Relationships between Variables.
Variables
Categorical Quantitative
Nominal Ordinal
Discrete
(counting)
Continuous
(measuring)
Ordered
categories Ranks
Types of variable in a statistical model
•Dependent vs. Independent variables:
• Dependent (response variables or outcome variable) which
vary depending on the effects of Independent variables
Examples: Weight, milk yield, Disease status (i.e. its
presence or absence)
• Independent (explanatory or predictor ) variables are those
variables that affects the dependent variables
Examples: can be sex, age, environment, breed,
management, genotype etc
Data coding
 Data Coding is an analytical process in which data are categorized by
numerical value to facilitate analysis.
 Coding means the transformation of data into a form understandable by
computer (statistical) software.
 Both qualitative and quantitative data can be coded to make data
computing with statistical software ease.
 Questionnaire data can be pre-coded (process of assigning codes to
expected answers on designed questionnaire),
 field-coded (process of assigning codes as soon as data is available,
usually during fieldwork),
 post-coded (coding of open questions on completed questionnaires) or
office-coded (done after fieldwork).
Examples of data coding
variables codes
sex
male 1
female 0
phy status
pregnant 1
lactate 2
dry 3
body cond
good 1
poor 0
Mastitis
positive 1
negative 0
variables codes
Cattle herd size
<40 0
> 40 1
animal age
< 4 y 0
> 4y 1
Milk yield
1 lit 1
2 – 3 lit 2
above 3 lit 3
Qualitative Quantitative
Chapter 2:
Strategies For Understanding The Meanings Of Data
 Data is collected with the intention of gathering (assembling)
information
 Information can be easily obtained from raw data when the
data set comprises relatively few observations made on a small
groups of animals
 As the number of observations becomes high, it is difficult to
obtain an overall ‘picture’ of the data
 The primary stage in the process of obtaining this picture is to
organize the data to establish how often different values occur
(frequency distributions).
Data description…
 The next step is to further condense the data, reducing to a
manageable size and obtain a snapshot view as an aid for
understanding and interpretation
 There are various methods we adopt
Tables to exhibit features of the data
Diagrams to illustrate patterns
Numerical measures to summarize the data
Data description…
 Graphical presentations of qualitative variables can include bar,
column or pie-charts.
 When describing qualitative data each observation is assigned
to a specific category. Data are then described by the number of
observations in each category or by the proportion of the total
number of observations.
 The most widely used graph for presentation of quantitative data
is a histogram.
 In order to present a distribution, the quantitative data are
partitioned into classes and the histogram shows the number or
relative frequency of observations for each class.
Frequency distribution
 A frequency distribution shows the frequencies of occurrence of
the observation in a data set.
 When making frequency distributions, it is vital to distinction
between categorical and quantitative variables.
When a variable is categorical, frequency observations occurs in
every class or category of the variable.
 When the variable is quantitative, class can be created between
non-overlapping, preferably equal intervals
Frequency distribution
The number of observations belonging to each class is the class
frequency i.e. frequency distribution
The frequency distribution is presented in the form of a table or
a bar chart (discrete variable) or a histogram (continuous
variables)
Relative frequency refers to the proportion or percentage
observation in each class or category
The sum of the relative frequencies of all the categories is unity
(or 100%) apart from rounding errors
Tables
A table is an orderly arrangement of observation usually
numbers in rows and columns,
The layout of the table will be dictated by the data, and
therefore will vary for different types of data
Table 1. Percentage of the households’ sources of income
Diagrams
A diagram is a graphic representation of data and may take
several forms (Chart, graph and Schematic)
It is often easier to distinguish important patterns from a
diagram rather than a table,
Are more useful to convey information quickly
Categorical data
Bar chart
 Is a diagram in which every category of the variable is
represented;
 The length of each bar, which should be of constant width,
depicts the number or percentage of individuals belonging
to that category.
 The length of the bar is proportional to the frequency in the
relevant category, so it is essential that the scale showing the
frequency should start at zero for each bar
Figure xx: Prevalence of Prevalence of Dairy Cattle Diseases (a) and ticks (b) in
cattle over time period in pastoral region
Categorical data
b. Tick prevalence
a. Prevalence of Dairy Cattle Diseases
Pie chart
 Is a circle divided into segments with each segment portraying a different
category of the qualitative variable.
 The total area of the circle represents of the total frequency or
percentage, and the area of a given sector is proportional to the
percentage of individuals falling into that category.
 A pie chart should include a statement of the percentage or actual
number of individuals in each segment
 Generally, bar chart is preferable to the pie chart as the former is easier
to construct and is more useful for comparative purposes, partly because
it is easier to compare lengths by eye rather than angles
Categorical data
Figure xx: Causes of camel calf mortalities in Borana area
(hypothetical data).
Categorical data
Series1,
septcemia,
44, 36%
Series1,
pnuemonia,
27, 22%
Series1,
diarrhea,
13, 11%
Series1,
Skin
necrosis,
5, 4%
Series1,
sunken
eyes, 14,
11%
Series1,
pox, 6, 5% Series1,
others, 13,
11%
Quantitative data
 When the data are quantitative, we may use Dot plot, Histogram,
Scatter plot, line graph, Box plot, Stem and Leaf
Dot diagram
If the data set is of a manageable size, the best way of display it is
to show every value in a dot diagram/plot
Fig xx. Dot diagram of mean daily tick count of different
species arround Hawassa
Histogram
Histogram is a two-dimensional diagram in which usually the
horizontal axis represents the units of the measurement of the
variable of interest, with each class interval being clearly
delineated
To construct histogram the data range is divided into 5 to 20
classes or bin to get equal width
Range = maximum – minmum value
If the intervals are of equal width, then the height of the bin
(rectangle) is proportional to the frequency
Histogram gives a good picture of the frequency distribution of
quantitative variables
Histogram…
The distribution is symmetrical if its shape to the right of a
central value is a mirror image of that to the left of the central
value
It is used to evaluate normal distribution
The tails of the frequency distribution represent the
frequencies at the extremes of the distribution
The frequency distribution is skewed to the right (positively
skewed) if the right-hand tail is extended
The frequency distribution skewed to the left (negatively
skewed) if the left-hand tail is extended
It is common to find biological data which are skewed to the
right
0
20
40
60
Frequency
250 300 350 400 450 500
Weight in kg
Figure xx: Histogram of weights of 344 dairy cows
Figure xx: Line graph showing mean monthly minimum, maximum
and average temperature for Borana areas (1976 -2011)
Line Graph
 Line graphs compare variables, each of which is plotted along
x-and-y coordinate.
 Show specific values of data, trends in data and enable viewer
to predict about.
Box- plot
The scale of measurement of the variable is usually drawn
vertically
The diagram comprises a box with horizontal limits defining
the upper and the lower quartiles and representing the
interquartile range,
the central 50% of the observations, with the median marked
by a horizontal line within the box
The range is as low as the 2.5th percentile and as high as the
97.5th percentile (the minimum and maximum values of the
set of observations)
0
20
40
60
80
100
1 2 3 4 5 6
Number of livestock species
Fig xx. Box plot showing livestock wealth by species diversity
Scatter diagram
The scatter diagram is an effective way of presenting data when
we are interested in trends and relationship between two
variables.
The diagram is a two-dimensional plot in which each axis
represents the scale of measurement of one of the two variables.
Using this rectangular co-ordinate system, we relate the value
for an individual on the horizontal scale to the corresponding
value for that individual on the vertical scale by marking with
an appropriate symbol
The points can be joined to produce a line graph, or draw a line
which best represents the relationship
Fig xx. Relationship of cattle population with rainfall in Borana
(between 1976 and 2011)
Stem and Leaf
Each value is divided into two parts, ‘Stem’ and ‘Leaf’. ‘Stem’
corresponds to higher decimal places, and ‘Leaf’ corresponds to
lower decimal places.
‘Stems’ are sorted in ascending order in the first column.
The appropriate ‘Leaf’ for each observation is recorded in the
row with the appropriate ‘Stem’
Fig xx. A ‘Stem and Leaf’ plot of
the weights of calves
Numerical measures of description
If we are able to determine some form of average that measures
the central tendency of the data set, and if we know how widely
scattered the observations are in either direction from that
average, then we will have a reasonable ‘picture’ of the data.
 These two characteristics of a set of observations measured on a
numerical variable are known as
Measures of location (averages, Central Tendency)
 give useful information about the center of the data
 Measures of dispersion (spread)
 how “spread out” the numbers are abut the center.
Measures of location
 The tendency of statistical data to get concentrated at certain
values is called the “Central Tendency” and
 The various methods of determining the actual value at which the
data tend to concentrate are called measures of central
Tendency or averages.
 Hence, an average is a value which tends to sum up or describe the
mass of the data.
 Measures of central tendency are numbers that tell us where the
majority of values in the distribution are located
 Common measures of central tendency are Mean, Media and
Mode.
Measures of location
1. Arithmetic mean
 is the most commonly used measure of location.
 It is obtained by adding together the observations in a data set
and dividing by the number of observations in the set
 The mean has the disadvantage that its value is influenced by
outliers
 An outlier is an observation whose value is highly inconsistent
with the main body of the data.
 An outlier with an excessively large value will tend to increase
the mean unduly, whilst a particularly small value will decrease
 Especially it is appropriate to measure location of data if the
observations were sampled from symmetrical distributions.
 The mean can be misleading if there are any extreme values in a group
of numbers.
 For example, the mean of the group 1, 2, 3, 2, 4, 5,19 is 5.1. The
value 19 is an extreme value, since it is far higher than any of the
other numbers in the group. Since only one of the values in the
group is actually 5.1 or greater, the mean is not representative of
the group.
 In this case, the median may provide a better representation.
 The mean will be ‘pulled’ to the right (increased in value) if the
distribution is skewed to the right, and ‘pulled’ to the left (decreased in
value) if the distribution is skewed to the left.
 The arithmetic mean of a sample of n numbers y1,y2,..., yn is:
 The arithmetic mean for grouped data is:
 The arithmetic mean and the median are close or equal in value if
the distribution is symmetrical.
Geometric mean: It is obtained by taking the nth root of the
product of “n” values, i.e, if the values of the observation are
demoted by x1,x2 ,…,x n then, GM = n√(x1)(x2)….(xn) .
It is preferable to the arithmetic mean if the series of observations
contains one or more unusually large values.
2. Median
 is the middle of value of the observation when they are arranged in
order of magnitude.
 It is appropriate for skewed data.
 To calculate the median: we have to arrange all of the recorded
values in order of size and then find the middle value.
 If we arrange the above numbers in numerical order, we
obtain: 1, 2, 2, 3, 4, 5, 19. The median is 3.
 In the above example, the median is much more representative of
the group than the mean (5.1). Extreme values do not affect the
median, and the median value is usually typical of the data.
2. Median
 If there is an even number of values, use the mean of the two
middle values:
 For example, for 19, 24, 26, 30, 31, 34, The median is (26 +
30)/2 = 28.
 The arithmetic mean and the median are close or equal in
value if the distribution is symmetrical.
 The advantage of the median is that it is not affected by
outliers or if the distribution of the data is skewed. Thus
the median will be less than the mean if the data are
skewed to the right, and greater than the mean if the
data are skewed to the left.
3. Mode
 is the most frequently occurring observation and the measure
does not involve the whole observation.
 It is not affected by extreme values and most commonly used in
skewed data.
 This can be determined by creating frequency table.
 The mode is determined by disregarding most of the
observations
 Some distributions do not have a mode, whilst other
distributions may have more than one mode. /Unimodal or
Bimodal/
 If we arrange the previous numbers in numerical
order, we obtain: 1, 2, 2, 3, 4, 5, 19. The mode is 2.
 Although the mean is the measure that is most common, when
distributions are asymmetric, the median and mode can give better
information about the set of data.
 Unusually extreme values in a sample will affect the arithmetic
mean more than the median. In that case the median is a more
representative measure of central tendency than the arithmetic
mean.
 For extremely asymmetric distributions the mode is the best
measure.
 Skewness: If extremely low or extremely high observations
are present in a distribution, then the mean tends to shift
towards those scores. Based on the type of skewness,
distributions can be:
a) Negatively skewed distribution: occurs when majority of
scores are at the right end of the curve and a few small scores
are scattered at the left end. (if it has a long tail to the left)
b) Positively skewed distribution: Occurs when the majority
of scores are at the left end of the curve and a few extreme
large scores are scattered at the right end. (if it has a long
tail to the right)
 Consider the three distributions shown in Figure
 For example, observation of the “No Skew” distribution would
yield the following data: 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9,
9, 10, 10, 10, 11, 11. Using SPSS software, the following descriptive
statistics were obtained for these three distributions
Exercise:
The following data, 44.4, 67.6, 76.2, 64.7, 80.0, 64.2, 75.0, 34.2,
29.2, represent the infection of goats with the viral condition peste
des petits ruminants. Calculate the median.
 Calculate the mean and the median of the following data set.
What evidence is there for concluding that the data are or are not
symmetrically distributed?
Arranged in ascending order, the rates (%) are: 29.2, 34.2, 44.4,
64.2, 64.7, 67.6, 75.0, 76.2 and 80.0. There are nine observations, so
the median is the (9 + 1)/2 = 5th observation in the ordered set, i.e.
the median is 64.7%.
Mean = 761.2/16 = 47.58 g, median = 51.95 g. The mean and the
median do not coincide, indicating that the data are skewed. The
mean is less than the median, indicating that the data are skewed
to the left.
Which Measure Should You Use?
 The choice of a particular measure of central tendency depends
on the shape of the population distribution. When we are dealing
with sample-based data, the distribution of the data from the
sample may suggest the shape of the population distribution.
 For normally distributed data, mathematical theory of the
normal distribution suggests that the arithmetic mean is the most
appropriate measure of central tendency.
If a log transformation creates normally distributed data, then the
geometric mean is appropriate to the raw data.
Which Measure Should You Use?
 For symmetric distributions, the mean and median are equal. If
the distribution is symmetric and has only one mode, all three
measures are the same.
 For skewed distributions, with a single mode, the three measures
differ.
 For positively skewed distributions (where the upper, or left, tail
of the distribution is longer (“fatter”) than the lower, or right, tail)
the measures are ordered as follows: mode < median < mean.
 For negatively skewed distributions (where the lower tail of the
distribution is longer than the upper tail), the reverse ordering
occurs: mean < median < mode.
Which Measure Should You Use?
 For symmetric distributions, the mean and median are equal. If
the distribution is symmetric and has only one mode, all three
measures are the same.
 For skewed distributions, with a single mode, the three measures
differ.
Figure xxx Symmetric (B) and skewed distributions:
right skewed (A) and left skewed (C)
Measures of Dispersion
Consider the following data sets:
Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
The two data sets given above have a mean of 50, but obviously set 1
is more “spread out” than set 2. How do we express this numerically?
The object of measuring this scatter or dispersion is to obtain a
single summary figure which adequately exhibits whether the
distribution is compact or spread out.
Figure shows the frequency polygons for two populations that have
equal means but different amounts of variability. Population B,
which is more variable than population A, is more spread out. If the
values are widely scattered, the dispersion is greater.
Figure xx: Two frequency distributions with equal mean but
different amount of dispersion
 Common measures of variability are the range, variance, standard
deviation and coefficient of variation.
1. Range is the difference between the maximum and minimum
values in a set of observations.
 It wastes information for it takes no account of the entire data.
 It gives undue weight to extreme values and will, therefore,
overestimate the dispersion of most of the observations if outliers
are present
2. Variance is the expected squared deviation of a random
variable from its mean
Measures of Dispersion
1
)
( 2
1




n
x
xi
n
i
 The variance is determined by calculating the deviation of each
observation from the mean.
 This deviation will be large if the observation is far from the
mean, and it will be small if the observation is close to the mean.
3. Standard deviation (S) is a measure of the scatter of the
observations in relation to their mean i.e. how close are the
observation to their mean.
 to obtain a measure of dispersion in original units
 It is the average absolute deviation from the mean:
4. The standard deviation is expressed as a percentage of the mean;
we call this measure the coefficient of variation (CV).
 It can be used for comparing relative amounts of variation. This is
especially true when variability is compared among sets of data
that have different units or even the same unit of measurement
 The standard error of the mean (SEM) is a measure of the
precision of the sample mean as an estimate of the population
mean. It evaluates the sampling error by giving an indication of
how close a sample mean is to the population mean it is estimating
(inferring). / It is indication of reliability of mean/
100
*
x
s
CV 
Confidence interval (Confidence limit)
 CL is the range of values within which the true population mean is
expected to lie with a certain probability (i.e. 95%).
 It has the lower and the upper limits of the confidence interval
 If the confidence interval is wide, then the sample mean is a poor
estimate of the population mean.
 If the confidence interval is narrow, then the sample mean is a
precise estimate of the population mean.
 The 95% confidence interval for the mean is calculated as
Mean ± 1.96 × SEM or (Sd),
Exercise
Calculate the standard deviation, variance and standrd error
of the following data
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
Column 1 Column 2 Column 3 Column 4
xi xi − x (xi − x) 2 (xi) 2
0 0 − 5 = − 5 25 0
1 1 − 5 = − 4 16 1
2 2 − 5 = − 3 9 4
3 3 − 5 = − 2 4 9
4 4 − 5 = − 1 1 16
5 5 − 5 = 0 0 25
6 6 − 5 = 1 1 36
7 7 − 5 = 2 4 49
8 8 − 5 = 3 9 64
9 9 − 5 = 4 16 81
10 10 − 5 = 5 25 100
55 0 110 385
 Calculate the mean (see the first column, xi). 55/11 = 5
 Subtract the mean from each observation to find the deviations
from the mean (see the 2nd column, xi − x ).
 Square the deviations from the mean (see the 3rd column, (xi −
x) 2 , above).
 Sum the squared deviations (see the 3rd column)= 110
 Divide the sum of the squared deviations by n–1 to find the
variance: 110/10 = 11
 Take the square root of the variance to calculate the standard
deviation: √s2 = √ 11.0 = 3.3
 SE = 3.3/√11 =0.9949
Exercise
The following are progesterone in the milk (ng/ml) of 14 cows, 4.37,
4.87, 4.35, 3.92, 4.68, 4.54, 5.24, 4.57, 4.59, 4.66, 4.40, 4.73, 4.83, 4.21.
Given the variance of 0.10177, Calculate the
A) Arithmetic mean
B) Median
C) Standard deviation
D) Standard error
E) 95% Confidence interval
F) Coefficient of variation (CV)
 Mean = 63.96/14 = 4.57
 Median = 4.58
 Variance =(SD)2 = (0.319)2 = 0.10177
 SD = √ Variance = √ 0.10177 = 0.319
 SE = 0.319/√14 =0.0853
 CV= (0.319/4.57)*100= 6.98%
 95% CL= 4.57 ± 1.96 × 0.0853 = [4.40, - 4.74 ]
 Suppose two samples of the following results:
 Which is more variable, the weights of the 25-year-olds or the
weights of the 11-year-olds?
 A comparison of the standard deviations might lead one to
conclude that the two samples possess equal variability.
 If we compute the coefficients of variation, however, we have
for the 25-year-olds
and for the 11-year-olds
 If we compare these results, we get quite a different impression.
It is clear from this example that variation is much higher in the
sample of 11-yearolds than in the sample of 25-year-olds.
 Kurtosis is a measure of the degree to which a distribution is
“peaked” or flat in comparison to a normal distribution whose
graph is characterized by a bell-shaped appearance.
 A distribution, in comparison to a normal distribution, may
possess an excessive proportion of observations in its tails, so
that its graph exhibits a flattened appearance. Such a
distribution is said to be platykurtic.
 Conversely, a distribution, in comparison to a normal
distribution, may possess a smaller proportion of observations
in its tails, so that its graph exhibits a more peaked appearance.
Such a distribution is said to be leptokurtic.
 A normal, or bell-shaped distribution, is said to be mesokurtic.
 Consider the three distributions shown in Figure
 For example, observation of the “mesokurtic” distribution would
yield the following data: 1, 2, 2, 3, 3, 3, 3, 3, … , 9, 9, 9, 9, 9, 10, 10,
11. Using SPSS software, the following descriptive statistics were
obtained for these three distributions:

More Related Content

Similar to Lect 1_Biostat.pdf

Exploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
Exploratory Data Analysis for Biotechnology and Pharmaceutical SciencesExploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
Exploratory Data Analysis for Biotechnology and Pharmaceutical SciencesParag Shah
 
Introduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptxIntroduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptxMelba Shaya Sweety
 
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGYBIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGYGauravBoruah
 
Bstat01 introduction
Bstat01 introductionBstat01 introduction
Bstat01 introductionjoebloggs1888
 
Chapter-one.pptx
Chapter-one.pptxChapter-one.pptx
Chapter-one.pptxAbebeNega
 
AGRICULTURAL-STATISTICS.pptx
AGRICULTURAL-STATISTICS.pptxAGRICULTURAL-STATISTICS.pptx
AGRICULTURAL-STATISTICS.pptxDianeJieRobuca1
 
Ebd1 lecture 3 2010
Ebd1 lecture 3  2010Ebd1 lecture 3  2010
Ebd1 lecture 3 2010Reko Kemo
 
Ebd1 lecture 3 2010
Ebd1 lecture 3  2010Ebd1 lecture 3  2010
Ebd1 lecture 3 2010Reko Kemo
 
Ebd1 lecture 3 2010
Ebd1 lecture 3  2010Ebd1 lecture 3  2010
Ebd1 lecture 3 2010Reko Kemo
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data AnalyticsSSaudia
 
introduction to statistical theory
introduction to statistical theoryintroduction to statistical theory
introduction to statistical theoryUnsa Shakir
 
Statistics as a discipline
Statistics as a disciplineStatistics as a discipline
Statistics as a disciplineRosalinaTPayumo
 
Introduction to Data Management in Human Ecology
Introduction to Data Management in Human EcologyIntroduction to Data Management in Human Ecology
Introduction to Data Management in Human EcologyKern Rocke
 

Similar to Lect 1_Biostat.pdf (20)

Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
Exploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
Exploratory Data Analysis for Biotechnology and Pharmaceutical SciencesExploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
Exploratory Data Analysis for Biotechnology and Pharmaceutical Sciences
 
Introduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptxIntroduction to nursing Statistics.pptx
Introduction to nursing Statistics.pptx
 
chapter 1.pptx
chapter 1.pptxchapter 1.pptx
chapter 1.pptx
 
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGYBIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
BIOSTATISTICS FUNDAMENTALS FOR BIOTECHNOLOGY
 
Frequency Distribution.pdf
Frequency Distribution.pdfFrequency Distribution.pdf
Frequency Distribution.pdf
 
Bstat01 introduction
Bstat01 introductionBstat01 introduction
Bstat01 introduction
 
Biostatistics Concept & Definition
Biostatistics Concept & DefinitionBiostatistics Concept & Definition
Biostatistics Concept & Definition
 
Chapter-one.pptx
Chapter-one.pptxChapter-one.pptx
Chapter-one.pptx
 
AGRICULTURAL-STATISTICS.pptx
AGRICULTURAL-STATISTICS.pptxAGRICULTURAL-STATISTICS.pptx
AGRICULTURAL-STATISTICS.pptx
 
Ebd1 lecture 3 2010
Ebd1 lecture 3  2010Ebd1 lecture 3  2010
Ebd1 lecture 3 2010
 
Ebd1 lecture 3 2010
Ebd1 lecture 3  2010Ebd1 lecture 3  2010
Ebd1 lecture 3 2010
 
Ebd1 lecture 3 2010
Ebd1 lecture 3  2010Ebd1 lecture 3  2010
Ebd1 lecture 3 2010
 
Statistics for Data Analytics
Statistics for Data AnalyticsStatistics for Data Analytics
Statistics for Data Analytics
 
introduction to statistical theory
introduction to statistical theoryintroduction to statistical theory
introduction to statistical theory
 
Statistics as a discipline
Statistics as a disciplineStatistics as a discipline
Statistics as a discipline
 
ANALYSIS OF DATA.pptx
ANALYSIS OF DATA.pptxANALYSIS OF DATA.pptx
ANALYSIS OF DATA.pptx
 
Introduction to Data Management in Human Ecology
Introduction to Data Management in Human EcologyIntroduction to Data Management in Human Ecology
Introduction to Data Management in Human Ecology
 
Introduction.pdf
Introduction.pdfIntroduction.pdf
Introduction.pdf
 

Recently uploaded

Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfWildaNurAmalia2
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555kikilily0909
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 

Recently uploaded (20)

Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxLIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx
 
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdfBUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
BUMI DAN ANTARIKSA PROJEK IPAS SMK KELAS X.pdf
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
‏‏VIRUS - 123455555555555555555555555555555555555555
‏‏VIRUS -  123455555555555555555555555555555555555555‏‏VIRUS -  123455555555555555555555555555555555555555
‏‏VIRUS - 123455555555555555555555555555555555555555
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 

Lect 1_Biostat.pdf

  • 2. Chapter 1: Introduction  Statistics: A field of study concerned with:  collection, organization, analysis, summarization and interpretation of numerical data, &  the drawing of inferences about a body of data when only a small part of the data is observed.  The subject of statistics covers:  the design of a study  the collection of data  the analysis of data  the presentation of suitably summarized information, often in a graphical or tabular form  The interpretation of the analyses in a manner which communicates the findings accurately
  • 3.  Biostatistics: it is the application of statistical methods to the fields of biological and medical sciences.  Concerned with interpretation of biological data & the communication of information derived from these data  Has central role in medical investigations  The numbers must be presented in such a way that valid interpretations are possible
  • 4. Why statistics?  Do research and publish scientific literature  Integral component of epidemiology  Risk analysis and predictions  Analysis of data from diagnostic services  Analysis of data from pharmaceutical and agrochemical industries  Safety and quality of food for human consumption
  • 5. Uses of Biostatistics • Provide methods of organizing information • Assessment of health status • Health program evaluation • Resource allocation • Magnitude of association – Strong vs weak association between exposure and outcome Example: Feeding vs Production Health vs Production
  • 6. Uses of biostatistics • Assessing risk factors – Cause & effect relationship (Eg, Environment/Housing vs Production) • Evaluation of a new vaccine or drug – What can be concluded if the proportion of animal free from the disease is greater among the vaccinated than the unvaccinated? – How effective is the vaccine (drug)? – Is the effect due to chance or some bias? • Drawing of inferences – Information from sample to population
  • 7. What does biostatistics cover? Research Planning Design Execution (Data collection) Data Processing Data Analysis Presentation Interpretation Publication Biostatistical thinking contribute in every step in a research The best way to learn about biostatistics is to follow the flow of a research from inception to the final publication
  • 8. Analysis • Analysis part is the major part of learning about biostatistics – There are dozens of different methods of analysis, which makes difficult the choice of the correct method for a particular case – It is necessary to consider the philosophy that underlies all methods of analysis: • Use data from a sample to draw inference about a wider population
  • 9. Analysis  The raw data are meaningless unless certain statistical treatment is given to them.  Analysis of data means to make the raw data meaningful or to draw some information from the data  Thus, the analysis of data serves the following main functions: • To make the raw data meaningful • To test null hypothesis • To test the significance • To draw some inferences or make generalization • To estimate parameters (sample statiscts and population parameters)
  • 10. Interpretation • Interpretation of results of statistical analysis is not always straightforward, but is simpler when the study has a clear aim. • If the study has been well designed and correctly analyzed the interpretation of results can be fairly simple.
  • 11. Types of Statistics 1. Descriptive statistics:  Ways of organizing and summarizing data  Helps to identify the general features and trends in a set of data and extracting useful information  Also very important in conveying the final results of a study Example: tables, graphs, numerical summary measures
  • 12. Types of Statistics 2. Inferential statistics: • Methods used for drawing conclusions about a population based on the information obtained from a sample of observations drawn from that population Example: Principles of probability, estimation, confidence interval, comparison of two or more means or proportions, hypothesis testing, etc.
  • 13. Statistical variables and data • A variable is a set of observations on a particular character that can take values which vary from individual to individual or group to group, • e.g. height, weight, housing, blood count, enzyme activity, coat colour, percentage of a flock which are pregnant, which are diseased etc… • Data are records of measurement, counts or observations of variables. • Examples of data are records of weights of calves in kg, milk yield of cows in liter, male or female sex, and black or white coat color of cattle.
  • 14. Types of Data 1. Primary data: collected from the items or individual respondents directly by the researcher for the purpose of a study. 2. Secondary data: which had been collected by certain people or organization, & statistically treated and the information contained in it is used for other purpose by other people. • Can be obtained from: – Routinely kept records, literature – Surveys – Counting – Experiments – Reports – Observation etc…
  • 15. Types of variables  Categorical (qualitative)  Nominal scale  Ordinal scale • Nominal scale (classification or group): the distinct categories which define the variables are unordered and each can be assigned a name, • It has has categories that cannot be ranked. e.g. coat colors (black, white)  Sex (male or female)  Breed (local, exotic)  Numerical (quantitative)  Discrete variable  Continous variable
  • 16. Types of variable… • Ordinal scale (ranked variables: small, medium, large): the categories which constitute the variable have some intrinsic order but there are no consistent and defined intervals between the various categories; • An ordinal variable has categories that can be ranked. For example: Body condition scores, Degree of vigor of motility of larvae.  These “scales” are often given numerical values 1 to n. However, the differences among those numbers do not have numerical meaning. Those scores depict categories, but not a numerical scale.
  • 17. Types of variable… Quantitative variables • Consisting of numerical values (true numerical measurement) on a well defined scale (measurement unit). • Quantitative data relate to amounts, rather than just indicating classes, • These data may be further divided into: discrete and continuous.
  • 18. Types of variable… • Discrete variable can have only one of a specified set of values, such as whole numbers. Discrete data often generate counts, i.e. it is countable. for example:- • The number of ticks collected from animals • Number of animals per households • Number of parasite eggs per gram of feces • Continuous variable theoretically may have any value within a defined range and potentially can take any value between intervals (though the range can be infinite). Examples are body weight, height, milk yield, temperature, and antibody titre.
  • 19. Relationships between Variables. Variables Categorical Quantitative Nominal Ordinal Discrete (counting) Continuous (measuring) Ordered categories Ranks
  • 20. Types of variable in a statistical model •Dependent vs. Independent variables: • Dependent (response variables or outcome variable) which vary depending on the effects of Independent variables Examples: Weight, milk yield, Disease status (i.e. its presence or absence) • Independent (explanatory or predictor ) variables are those variables that affects the dependent variables Examples: can be sex, age, environment, breed, management, genotype etc
  • 21. Data coding  Data Coding is an analytical process in which data are categorized by numerical value to facilitate analysis.  Coding means the transformation of data into a form understandable by computer (statistical) software.  Both qualitative and quantitative data can be coded to make data computing with statistical software ease.  Questionnaire data can be pre-coded (process of assigning codes to expected answers on designed questionnaire),  field-coded (process of assigning codes as soon as data is available, usually during fieldwork),  post-coded (coding of open questions on completed questionnaires) or office-coded (done after fieldwork).
  • 22. Examples of data coding variables codes sex male 1 female 0 phy status pregnant 1 lactate 2 dry 3 body cond good 1 poor 0 Mastitis positive 1 negative 0 variables codes Cattle herd size <40 0 > 40 1 animal age < 4 y 0 > 4y 1 Milk yield 1 lit 1 2 – 3 lit 2 above 3 lit 3 Qualitative Quantitative
  • 23. Chapter 2: Strategies For Understanding The Meanings Of Data  Data is collected with the intention of gathering (assembling) information  Information can be easily obtained from raw data when the data set comprises relatively few observations made on a small groups of animals  As the number of observations becomes high, it is difficult to obtain an overall ‘picture’ of the data  The primary stage in the process of obtaining this picture is to organize the data to establish how often different values occur (frequency distributions).
  • 24. Data description…  The next step is to further condense the data, reducing to a manageable size and obtain a snapshot view as an aid for understanding and interpretation  There are various methods we adopt Tables to exhibit features of the data Diagrams to illustrate patterns Numerical measures to summarize the data
  • 25. Data description…  Graphical presentations of qualitative variables can include bar, column or pie-charts.  When describing qualitative data each observation is assigned to a specific category. Data are then described by the number of observations in each category or by the proportion of the total number of observations.  The most widely used graph for presentation of quantitative data is a histogram.  In order to present a distribution, the quantitative data are partitioned into classes and the histogram shows the number or relative frequency of observations for each class.
  • 26. Frequency distribution  A frequency distribution shows the frequencies of occurrence of the observation in a data set.  When making frequency distributions, it is vital to distinction between categorical and quantitative variables. When a variable is categorical, frequency observations occurs in every class or category of the variable.  When the variable is quantitative, class can be created between non-overlapping, preferably equal intervals
  • 27. Frequency distribution The number of observations belonging to each class is the class frequency i.e. frequency distribution The frequency distribution is presented in the form of a table or a bar chart (discrete variable) or a histogram (continuous variables) Relative frequency refers to the proportion or percentage observation in each class or category The sum of the relative frequencies of all the categories is unity (or 100%) apart from rounding errors
  • 28. Tables A table is an orderly arrangement of observation usually numbers in rows and columns, The layout of the table will be dictated by the data, and therefore will vary for different types of data Table 1. Percentage of the households’ sources of income
  • 29. Diagrams A diagram is a graphic representation of data and may take several forms (Chart, graph and Schematic) It is often easier to distinguish important patterns from a diagram rather than a table, Are more useful to convey information quickly
  • 30. Categorical data Bar chart  Is a diagram in which every category of the variable is represented;  The length of each bar, which should be of constant width, depicts the number or percentage of individuals belonging to that category.  The length of the bar is proportional to the frequency in the relevant category, so it is essential that the scale showing the frequency should start at zero for each bar
  • 31. Figure xx: Prevalence of Prevalence of Dairy Cattle Diseases (a) and ticks (b) in cattle over time period in pastoral region Categorical data b. Tick prevalence a. Prevalence of Dairy Cattle Diseases
  • 32. Pie chart  Is a circle divided into segments with each segment portraying a different category of the qualitative variable.  The total area of the circle represents of the total frequency or percentage, and the area of a given sector is proportional to the percentage of individuals falling into that category.  A pie chart should include a statement of the percentage or actual number of individuals in each segment  Generally, bar chart is preferable to the pie chart as the former is easier to construct and is more useful for comparative purposes, partly because it is easier to compare lengths by eye rather than angles Categorical data
  • 33. Figure xx: Causes of camel calf mortalities in Borana area (hypothetical data). Categorical data Series1, septcemia, 44, 36% Series1, pnuemonia, 27, 22% Series1, diarrhea, 13, 11% Series1, Skin necrosis, 5, 4% Series1, sunken eyes, 14, 11% Series1, pox, 6, 5% Series1, others, 13, 11%
  • 34. Quantitative data  When the data are quantitative, we may use Dot plot, Histogram, Scatter plot, line graph, Box plot, Stem and Leaf Dot diagram If the data set is of a manageable size, the best way of display it is to show every value in a dot diagram/plot Fig xx. Dot diagram of mean daily tick count of different species arround Hawassa
  • 35. Histogram Histogram is a two-dimensional diagram in which usually the horizontal axis represents the units of the measurement of the variable of interest, with each class interval being clearly delineated To construct histogram the data range is divided into 5 to 20 classes or bin to get equal width Range = maximum – minmum value If the intervals are of equal width, then the height of the bin (rectangle) is proportional to the frequency Histogram gives a good picture of the frequency distribution of quantitative variables
  • 36. Histogram… The distribution is symmetrical if its shape to the right of a central value is a mirror image of that to the left of the central value It is used to evaluate normal distribution The tails of the frequency distribution represent the frequencies at the extremes of the distribution The frequency distribution is skewed to the right (positively skewed) if the right-hand tail is extended The frequency distribution skewed to the left (negatively skewed) if the left-hand tail is extended It is common to find biological data which are skewed to the right
  • 37. 0 20 40 60 Frequency 250 300 350 400 450 500 Weight in kg Figure xx: Histogram of weights of 344 dairy cows
  • 38. Figure xx: Line graph showing mean monthly minimum, maximum and average temperature for Borana areas (1976 -2011) Line Graph  Line graphs compare variables, each of which is plotted along x-and-y coordinate.  Show specific values of data, trends in data and enable viewer to predict about.
  • 39. Box- plot The scale of measurement of the variable is usually drawn vertically The diagram comprises a box with horizontal limits defining the upper and the lower quartiles and representing the interquartile range, the central 50% of the observations, with the median marked by a horizontal line within the box The range is as low as the 2.5th percentile and as high as the 97.5th percentile (the minimum and maximum values of the set of observations)
  • 40. 0 20 40 60 80 100 1 2 3 4 5 6 Number of livestock species Fig xx. Box plot showing livestock wealth by species diversity
  • 41. Scatter diagram The scatter diagram is an effective way of presenting data when we are interested in trends and relationship between two variables. The diagram is a two-dimensional plot in which each axis represents the scale of measurement of one of the two variables. Using this rectangular co-ordinate system, we relate the value for an individual on the horizontal scale to the corresponding value for that individual on the vertical scale by marking with an appropriate symbol The points can be joined to produce a line graph, or draw a line which best represents the relationship
  • 42. Fig xx. Relationship of cattle population with rainfall in Borana (between 1976 and 2011)
  • 43. Stem and Leaf Each value is divided into two parts, ‘Stem’ and ‘Leaf’. ‘Stem’ corresponds to higher decimal places, and ‘Leaf’ corresponds to lower decimal places. ‘Stems’ are sorted in ascending order in the first column. The appropriate ‘Leaf’ for each observation is recorded in the row with the appropriate ‘Stem’ Fig xx. A ‘Stem and Leaf’ plot of the weights of calves
  • 44. Numerical measures of description If we are able to determine some form of average that measures the central tendency of the data set, and if we know how widely scattered the observations are in either direction from that average, then we will have a reasonable ‘picture’ of the data.  These two characteristics of a set of observations measured on a numerical variable are known as Measures of location (averages, Central Tendency)  give useful information about the center of the data  Measures of dispersion (spread)  how “spread out” the numbers are abut the center.
  • 45. Measures of location  The tendency of statistical data to get concentrated at certain values is called the “Central Tendency” and  The various methods of determining the actual value at which the data tend to concentrate are called measures of central Tendency or averages.  Hence, an average is a value which tends to sum up or describe the mass of the data.  Measures of central tendency are numbers that tell us where the majority of values in the distribution are located  Common measures of central tendency are Mean, Media and Mode.
  • 46. Measures of location 1. Arithmetic mean  is the most commonly used measure of location.  It is obtained by adding together the observations in a data set and dividing by the number of observations in the set  The mean has the disadvantage that its value is influenced by outliers  An outlier is an observation whose value is highly inconsistent with the main body of the data.  An outlier with an excessively large value will tend to increase the mean unduly, whilst a particularly small value will decrease
  • 47.  Especially it is appropriate to measure location of data if the observations were sampled from symmetrical distributions.  The mean can be misleading if there are any extreme values in a group of numbers.  For example, the mean of the group 1, 2, 3, 2, 4, 5,19 is 5.1. The value 19 is an extreme value, since it is far higher than any of the other numbers in the group. Since only one of the values in the group is actually 5.1 or greater, the mean is not representative of the group.  In this case, the median may provide a better representation.  The mean will be ‘pulled’ to the right (increased in value) if the distribution is skewed to the right, and ‘pulled’ to the left (decreased in value) if the distribution is skewed to the left.
  • 48.  The arithmetic mean of a sample of n numbers y1,y2,..., yn is:  The arithmetic mean for grouped data is:  The arithmetic mean and the median are close or equal in value if the distribution is symmetrical. Geometric mean: It is obtained by taking the nth root of the product of “n” values, i.e, if the values of the observation are demoted by x1,x2 ,…,x n then, GM = n√(x1)(x2)….(xn) . It is preferable to the arithmetic mean if the series of observations contains one or more unusually large values.
  • 49. 2. Median  is the middle of value of the observation when they are arranged in order of magnitude.  It is appropriate for skewed data.  To calculate the median: we have to arrange all of the recorded values in order of size and then find the middle value.  If we arrange the above numbers in numerical order, we obtain: 1, 2, 2, 3, 4, 5, 19. The median is 3.  In the above example, the median is much more representative of the group than the mean (5.1). Extreme values do not affect the median, and the median value is usually typical of the data.
  • 50. 2. Median  If there is an even number of values, use the mean of the two middle values:  For example, for 19, 24, 26, 30, 31, 34, The median is (26 + 30)/2 = 28.  The arithmetic mean and the median are close or equal in value if the distribution is symmetrical.  The advantage of the median is that it is not affected by outliers or if the distribution of the data is skewed. Thus the median will be less than the mean if the data are skewed to the right, and greater than the mean if the data are skewed to the left.
  • 51. 3. Mode  is the most frequently occurring observation and the measure does not involve the whole observation.  It is not affected by extreme values and most commonly used in skewed data.  This can be determined by creating frequency table.  The mode is determined by disregarding most of the observations  Some distributions do not have a mode, whilst other distributions may have more than one mode. /Unimodal or Bimodal/  If we arrange the previous numbers in numerical order, we obtain: 1, 2, 2, 3, 4, 5, 19. The mode is 2.
  • 52.  Although the mean is the measure that is most common, when distributions are asymmetric, the median and mode can give better information about the set of data.  Unusually extreme values in a sample will affect the arithmetic mean more than the median. In that case the median is a more representative measure of central tendency than the arithmetic mean.  For extremely asymmetric distributions the mode is the best measure.
  • 53.  Skewness: If extremely low or extremely high observations are present in a distribution, then the mean tends to shift towards those scores. Based on the type of skewness, distributions can be: a) Negatively skewed distribution: occurs when majority of scores are at the right end of the curve and a few small scores are scattered at the left end. (if it has a long tail to the left) b) Positively skewed distribution: Occurs when the majority of scores are at the left end of the curve and a few extreme large scores are scattered at the right end. (if it has a long tail to the right)
  • 54.  Consider the three distributions shown in Figure  For example, observation of the “No Skew” distribution would yield the following data: 5, 5, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 11, 11. Using SPSS software, the following descriptive statistics were obtained for these three distributions
  • 55. Exercise: The following data, 44.4, 67.6, 76.2, 64.7, 80.0, 64.2, 75.0, 34.2, 29.2, represent the infection of goats with the viral condition peste des petits ruminants. Calculate the median.  Calculate the mean and the median of the following data set. What evidence is there for concluding that the data are or are not symmetrically distributed?
  • 56. Arranged in ascending order, the rates (%) are: 29.2, 34.2, 44.4, 64.2, 64.7, 67.6, 75.0, 76.2 and 80.0. There are nine observations, so the median is the (9 + 1)/2 = 5th observation in the ordered set, i.e. the median is 64.7%. Mean = 761.2/16 = 47.58 g, median = 51.95 g. The mean and the median do not coincide, indicating that the data are skewed. The mean is less than the median, indicating that the data are skewed to the left.
  • 57. Which Measure Should You Use?  The choice of a particular measure of central tendency depends on the shape of the population distribution. When we are dealing with sample-based data, the distribution of the data from the sample may suggest the shape of the population distribution.  For normally distributed data, mathematical theory of the normal distribution suggests that the arithmetic mean is the most appropriate measure of central tendency. If a log transformation creates normally distributed data, then the geometric mean is appropriate to the raw data.
  • 58. Which Measure Should You Use?  For symmetric distributions, the mean and median are equal. If the distribution is symmetric and has only one mode, all three measures are the same.  For skewed distributions, with a single mode, the three measures differ.  For positively skewed distributions (where the upper, or left, tail of the distribution is longer (“fatter”) than the lower, or right, tail) the measures are ordered as follows: mode < median < mean.  For negatively skewed distributions (where the lower tail of the distribution is longer than the upper tail), the reverse ordering occurs: mean < median < mode.
  • 59. Which Measure Should You Use?  For symmetric distributions, the mean and median are equal. If the distribution is symmetric and has only one mode, all three measures are the same.  For skewed distributions, with a single mode, the three measures differ. Figure xxx Symmetric (B) and skewed distributions: right skewed (A) and left skewed (C)
  • 60. Measures of Dispersion Consider the following data sets: Mean Set 1: 60 40 30 50 60 40 70 50 Set 2: 50 49 49 51 48 50 53 50 The two data sets given above have a mean of 50, but obviously set 1 is more “spread out” than set 2. How do we express this numerically? The object of measuring this scatter or dispersion is to obtain a single summary figure which adequately exhibits whether the distribution is compact or spread out.
  • 61. Figure shows the frequency polygons for two populations that have equal means but different amounts of variability. Population B, which is more variable than population A, is more spread out. If the values are widely scattered, the dispersion is greater. Figure xx: Two frequency distributions with equal mean but different amount of dispersion
  • 62.  Common measures of variability are the range, variance, standard deviation and coefficient of variation. 1. Range is the difference between the maximum and minimum values in a set of observations.  It wastes information for it takes no account of the entire data.  It gives undue weight to extreme values and will, therefore, overestimate the dispersion of most of the observations if outliers are present 2. Variance is the expected squared deviation of a random variable from its mean Measures of Dispersion
  • 63. 1 ) ( 2 1     n x xi n i  The variance is determined by calculating the deviation of each observation from the mean.  This deviation will be large if the observation is far from the mean, and it will be small if the observation is close to the mean. 3. Standard deviation (S) is a measure of the scatter of the observations in relation to their mean i.e. how close are the observation to their mean.  to obtain a measure of dispersion in original units  It is the average absolute deviation from the mean:
  • 64. 4. The standard deviation is expressed as a percentage of the mean; we call this measure the coefficient of variation (CV).  It can be used for comparing relative amounts of variation. This is especially true when variability is compared among sets of data that have different units or even the same unit of measurement  The standard error of the mean (SEM) is a measure of the precision of the sample mean as an estimate of the population mean. It evaluates the sampling error by giving an indication of how close a sample mean is to the population mean it is estimating (inferring). / It is indication of reliability of mean/ 100 * x s CV 
  • 65. Confidence interval (Confidence limit)  CL is the range of values within which the true population mean is expected to lie with a certain probability (i.e. 95%).  It has the lower and the upper limits of the confidence interval  If the confidence interval is wide, then the sample mean is a poor estimate of the population mean.  If the confidence interval is narrow, then the sample mean is a precise estimate of the population mean.  The 95% confidence interval for the mean is calculated as Mean ± 1.96 × SEM or (Sd),
  • 66. Exercise Calculate the standard deviation, variance and standrd error of the following data 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
  • 67. Column 1 Column 2 Column 3 Column 4 xi xi − x (xi − x) 2 (xi) 2 0 0 − 5 = − 5 25 0 1 1 − 5 = − 4 16 1 2 2 − 5 = − 3 9 4 3 3 − 5 = − 2 4 9 4 4 − 5 = − 1 1 16 5 5 − 5 = 0 0 25 6 6 − 5 = 1 1 36 7 7 − 5 = 2 4 49 8 8 − 5 = 3 9 64 9 9 − 5 = 4 16 81 10 10 − 5 = 5 25 100 55 0 110 385
  • 68.  Calculate the mean (see the first column, xi). 55/11 = 5  Subtract the mean from each observation to find the deviations from the mean (see the 2nd column, xi − x ).  Square the deviations from the mean (see the 3rd column, (xi − x) 2 , above).  Sum the squared deviations (see the 3rd column)= 110  Divide the sum of the squared deviations by n–1 to find the variance: 110/10 = 11  Take the square root of the variance to calculate the standard deviation: √s2 = √ 11.0 = 3.3  SE = 3.3/√11 =0.9949
  • 69. Exercise The following are progesterone in the milk (ng/ml) of 14 cows, 4.37, 4.87, 4.35, 3.92, 4.68, 4.54, 5.24, 4.57, 4.59, 4.66, 4.40, 4.73, 4.83, 4.21. Given the variance of 0.10177, Calculate the A) Arithmetic mean B) Median C) Standard deviation D) Standard error E) 95% Confidence interval F) Coefficient of variation (CV)
  • 70.  Mean = 63.96/14 = 4.57  Median = 4.58  Variance =(SD)2 = (0.319)2 = 0.10177  SD = √ Variance = √ 0.10177 = 0.319  SE = 0.319/√14 =0.0853  CV= (0.319/4.57)*100= 6.98%  95% CL= 4.57 ± 1.96 × 0.0853 = [4.40, - 4.74 ]
  • 71.  Suppose two samples of the following results:  Which is more variable, the weights of the 25-year-olds or the weights of the 11-year-olds?
  • 72.  A comparison of the standard deviations might lead one to conclude that the two samples possess equal variability.  If we compute the coefficients of variation, however, we have for the 25-year-olds and for the 11-year-olds  If we compare these results, we get quite a different impression. It is clear from this example that variation is much higher in the sample of 11-yearolds than in the sample of 25-year-olds.
  • 73.  Kurtosis is a measure of the degree to which a distribution is “peaked” or flat in comparison to a normal distribution whose graph is characterized by a bell-shaped appearance.  A distribution, in comparison to a normal distribution, may possess an excessive proportion of observations in its tails, so that its graph exhibits a flattened appearance. Such a distribution is said to be platykurtic.  Conversely, a distribution, in comparison to a normal distribution, may possess a smaller proportion of observations in its tails, so that its graph exhibits a more peaked appearance. Such a distribution is said to be leptokurtic.  A normal, or bell-shaped distribution, is said to be mesokurtic.
  • 74.  Consider the three distributions shown in Figure  For example, observation of the “mesokurtic” distribution would yield the following data: 1, 2, 2, 3, 3, 3, 3, 3, … , 9, 9, 9, 9, 9, 10, 10, 11. Using SPSS software, the following descriptive statistics were obtained for these three distributions: