Biostatistics

BIOSTATISTICS
P R E S E N T E D B Y ,
D R . A N J U M A T H E W . K
F I R S T Y E A R M D S
D E P A R T M E N T O F P E R I O D O N T I C S

•Statistics is a very broad subject, with applications in a vast number of different
fields.
• In generally one can say that statistics is the methodology for collecting,
analyzing, interpreting and drawing conclusions from information.
•Statistics is the methodology which scientists and mathematicians have developed
for interpreting and drawing conclusions from collected data

DEFINITION
Statistics consists of a body of methods for collecting and analyzing data. (Agresti &
Finlay, 1997)
Statistics is much more than just the tabulation of numbers and the graphical
presentation of these tabulated numbers.
Statistics is the science of gaining information from numerical and categorical data
Statistical methods can be used to find answers to the questions like:
• What kind and how much data need to be collected?
• How should we organize and summarize the data?
• How can we analyse the data and draw conclusions from it?
• How can we assess the strength of the conclusions and evaluate their
uncertainty?

BIOSTATISTICS
•Deals with the statistical methodologies involved in biological
sciences
•As medicine is a branch of biology, medical statistics is a branch of
biostatistics

SAMPLING
•Sampling is the process of technique or selecting a sample of appropriate
characteristics and adequate size
•Sampling of two types
1.Probability sampling
2.Nonprobability sampling
In PROBABILITY SAMPLING -give all the members of a population equal
chance of being selected
In NONPROBABILITY SAMPLING – samples are collected in a way that
does not give all the units in the population equal chances of being selected

TYPES OF SAMPLING TECHNIQUES
Probability sampling Non probability sampling
1.Simple random 1.Accidental/convenience
2.Stratified random 2.Judgement/purposive
3.Systemic random 3.Network/snowball
4.Area/cluster sampling 4.Quota sampling
5.Dimensional sampling
6.Mixed sampling

Simple random sampling
Every member of population has an equal chance of being
included in the sample. This type of sampling used when the
population in homogenous
Stratified random sampling
Divides the population into groups called strata. It is by some
characteristic, not geographically. The population might be
separated into males and females.

Systemic random sampling
Sample members from a larger population are selected
according to a random starting point but with a fixed,
periodic interval. This interval, called the sampling
interval, is calculated by dividing the population size by
the desired sample size.
Area or cluster sampling
Cluster sampling is accomplished by dividing the
population into groups usually geographically. These
groups are called clusters or blocks. The clusters are
randomly selected, and each element in the selected
clusters are used. For example in a dental survey in
schools each section in a class could be used as a
cluster

Accidental or convenience sampling
Sampling is very easy to do and often used by health
professionals. You will have to examine the people you
are able to contact or get access to. In expensive and
less time consuming
Judgement or purposive sampling sampling
In which researchers rely on their own judgment when
choosing members of the population to participate in
their study

Network or snow ball sampling
Multistage technique. The researcher must first
identify and interview a few subjects with requisite
criteria. These subjects are then asked to identify
other with same criteria these persons are then asked
to identify others until a satisfactory sample is
obtained
Quota sampling
Researchers create a sample involving individuals
that represent a population. Researchers choose these
individuals according to specific traits or qualities

Dimensional sampling
Is an extension to quota sampling. The researcher takes into account several characteristics (e.g.
Gender, income, residence and education). The researcher must ensure that there is at least one
person in the study representing each of the chosen characteristics
Mixed sampling designs
Constitute the combination of both probability and nonprobability sampling procedures

USES OF SAMPLING
•May be the only way to obtain information about a population
•The need to reduce labour and hence cost
•Savings in time, manpower and money

ERRORS IN SAMPLING
•Two types of errors that arise in sampling
1.Sampling error
2.Nonsampling error
•Sampling error
That creep in due to the sampling process and could arise because of
faulty sample design or due to the small size of the sample
•Non sampling errors
a) Coverage error: due to non cooperation of the informant
b) Observational error: due to interviewers bias or imperfect
experimental technique or interaction of both
c) Processing error: due to errors in statistical analysis

DATA
•Data analysis is the cornerstone in reporting research findings
•Data is a set of values of one or more variables recorded on one or
more individuals

TYPES OF DATA
1. Primary data
2. Secondary data

Primary data
Data obtained directly from an individual
ADVANTAGES
1. Precise information
2. Reliable
DISADVANTAGES
1.Time consuming
2.expensive
Secondary data
It is obtained from outside sources eg:hospital records,school register

VARIABLES
A variable is a state ,condition, concept or event whose value is free to vary
within the population
TYPES OF VARIABLES
1.Quantitative
-Discrete
-Continous
2.Qualitative
-Categorical
-Ordered

METHOD OF COLLECTION OF
DATA
1. Questionnaires
2. Surveys
3. Records
4. Interviews

PRESENTATION OF DATA
•Statistical data once collected must be arranged purposively in order
to bring out the important points clearly and strikingly
•The manner in which statistical data is presented is of utmost
importance

METHODS OF PRESENTING DATA
I. Tabulation
Simple tables
Frequency distribution table
II. Charts and diagrams
Bar charts
a. Simple bar chart
b. Multiple bar chart
c. Component bar chart
Histogram
a. Frequency polygon
b. Frequency curve
Pie chart
Pictogram
III. Line diagrams
IV. Statistical maps

TABULATION
•Tables are devices for presenting data
•Tabulation is the first step before the data is used for analysis or interpretation
GENERAL PRINCIPLES BEFORE DESIGNING TABLES
1.The table should be numbered eg: Table 1.Table 2. etc
2.A title must be given to each table. The title must be brief and self explanatory
3.The headings of columns and rows should be clear and concise
4.The data must be presented according to size or importance chronologically, alphabetically or
geographically
5.If percentage or average are to be compared they should be placed as close as possible
6.No table should be too large
7.Foot notes may be given where necessary, providing explanatory notes or additional
information

FREQUENCY DISTRIBUTION
TABLE
The data is first split up into convenient groups (class intervals)and the number of
items(frequency) occur in each group

CHARTS AND DIAGRAMS
•Useful method of presenting simple statistical data
•They have powerful impact on the imagination of people, so they are a powerful
media of expressing statistical data
ADVANTAGES
1.Diagrams are better retained in memory than tables
2.If the diagrams are drawn simple the impact on the reader much higher
DISADVANTAGES
1.Loss of details of the original data may be lost in charts and diagrams

BAR CHARTS
A diagram of columns or bars the height of the bars determine the value of the
particular data in question
SIMPLE BAR CHART

COMPONET BAR CHART
When there are two sets of similar information they can be contrasted by
displaying both sets on same graph

HISTOGRAMS
A special sort of bar chart. The successive
groups of data is linked in a definite
numerical data
Frequency polygon
A frequency distribution may also be
represented diagrammatically by the
frequency polygon
It is obtained by joining the mid points of the
histogram blocks
Frequency curve
The frequency curve for a distribution can be
obtained by drawing a smooth and free
hand curve through the midpoints

PIE CHARTS
Another way of displaying data.
PICTOGRAMS
Pictorial or diagrammatical data
represented by pictorial symbol

LINE GRAPH
When the quantity is a continuous variable
STATISTICAL MAPS
When statistical data refer to geographic or
administrative areas ,it is presented either as
shaded maps or dot maps

USES OF DATA
•In designing health care programme
•In evaluating the effectiveness of an on going program
•In determination of needs of a specific population
•In evaluating the scientific accuracy of a journal article

MEASURES OF CENTRAL
TENDENCY
•Central tendency:It is the value around which the other values are
distributed
•The main objective of measure of central tendency is to condense the
entire mass of data and to facilitate comparison
•Arithmetic mean
•Median
•Mode

z
MEAN
•This measures implies the arithmetic average or arithmetic mean
•It is obtained by summing up all observations and dividing the total number of observations
•Eg: No. of days patients stayed each day in hospital under Dr. A is: 2,4,3,4,6,6,2,5
•Mean (X) = Sum of all observations/Number of observations = 32/8 = 4
•ADVANTAGES
•Easy to calculate
•Easy to understand
•Utilize entire data
•Amenable to algebraic manipulation
•Affords good comparison
DISADVANTAGES
•Mean is affected by extreme values. In such cases it leads to bad interpretation

MEDIAN
The data arranged in an ascending or descending order of magnitude and the value of middle observation is located
Eg 1: No. of days patients stayed in hospital under Dr. A is: 2,4,3,4,6,6,2,5
Ascending order: 2,2,3,4,4,5,6,6
Median = (4+4)/2 = 8/2 = 4
Eg 2: No. of days patients stayed in hospital under Dr. A is: 2,4,3,4,6,6,2
Descending order: 6,6,4,4,3,2,2
Median: 4
ADVANTAGES
• It is more representative than mean
• It does not depend on every observations
•It is not affected by extreme values
•DISADVANTAGES
•Data has to be arraned before calculation. Hence mean is easier to use as a sample statistic than a population parameter
•More complex statistical procedures than mean

MODE
Value which occurs with the greatest frequency
Eg 1 : No. of days patients stayed in hospital under Dr. A is: 2,4,3,1,6,6,8,5
Mode: 6 i.e. the distribution is unimodal
Eg 1 : No. of days patients stayed in hospital under Dr. A is: 2,4,3,4,6,6,8,5
Mode: 6 & 4 i.e. the distribution is bimodal
ADVANTAGES
•It eliminates extreme variation
•Easily located by mean inspection
•Easy to understand
DISADVANTAGES
•Exact location is uncertain
•It is not exactly defined
•In small number of cases there may be no mode at all because no value may be repeated therefore it is not used in
medical or biological statistics

MEASURES OF DISPERSION
•Measures of dispersion helps to know how widely the observations are spread on
either side of the average
•Dispersion is the degree of spread or variation of the variable about a central
value
•The range
•The mean deviation
•The standard deviation
PURPOSE OF MEASURES OF DISPERSION
•To study the variability of data
•For accounting the variability in data

THE RANGE
•The difference between the highest and lower figures in a given sample.
•Range = Xmax - Xmin
ADVANTAGES
•Easy to calculate
DISADVANTAGES
•Unstable
•It is affected by one extremely high or low score
•It is of no practical importance because it does not indicate anything about the
dispersion of values between the two extreme values

THE MEAN DEVIATION
•It is the average of deviation from the arithmetic mean
•It is the one way of measuring how closely the individual scores in the data set
cluster around the mean. This is done by
• M.D. = Ʃ (x-x)/n
•Where Ʃ (sigma) is the sum of, x is the value of each observation in the data, x
is the arithmetic mean and n is the number of observation in the data.
•Eg : No. of days patients stayed in hospital under Dr. A is: 2,4,3,4,6,6,2,5
•x = 32/8 = 4
• (x-x) = -2,0,-1,0,2,2,-2,-1 ; Ʃ (x-x) = -2+0+-1+0+2+2+-2+-1 = 0
•Zero will obviously not reflect the degree of dispersion. To solve this problem
we can square each deviation score

THE MEAN DEVIATION
• (x-x)2 = 4,0,1,0,4,4,4,1 ; Ʃ (x-x)2 = 18
• Ʃ (x-x)2/n = 18/8 = 2.25
• The resulting value is the variance.
• The Variance is the average of the squared deviations from the mean of
a set of scores.
• i.e. Ʃ (x-x)2/n

STANDARD DEVIATION
•Most frequently used measure of deviation
•Defined as root mean square deviation
•Denoted by the Greek letter Sigma s or by the initials S.D
•S.D is the square root of the Variance
•S.D = √(x-x)2/n
•Therefore for Dr. A, S.D = √ 2.25 = 1.5

TESTS OF SIGNIFICANCE
•Whenever two sets of observation are to be compared, it becomes
essential to find out whether the difference observed between the two
group is because of sampling variation or any other factor
•The method by which this done is called Tests of significance
1. Standard error test for large samples
2. Chi square test
3. Standard error test for small samples

STANDARD ERROR TEST FOR LARGE
SAMPLES
•A sample is considered to be large when it has more than 30
observations
•When the difference between any two large sample in terms of means
or portion need to be tested the formula used is as
•(a). Standard error of mean
•The standard error of mean gives the standard deviation of mean of
several samples from the same population. Standard error can be
estimated from a single sample.
•Standard error (S.E) of mean = S.D/ √n

•(b). Standard error (S.E) of proportion = √pq/n
•Where p and q are the proportion of occurrence of an event in two groups of
the sample and n is the sample size.
•(c). Standard error of difference between two means
•It is used to find out whether the difference between the means of two groups
is significant to indicate that the samples represent two different universes.
•Standard error between means = √S.D1
2/n1 + S.D2
2/n2
•(d). Standard error of difference between proportions
•It is used to find out whether the difference between the proportions of two
groups is significant or has occurred by chance.
•Standard error between proportions = √p1q1/n1+p2q2/n2

CHI SQUARE TEST
It is alternative method of testing the significance of difference between two proportions
Eg: If there are two groups, one of which has received oral hygiene instructions and the other has not received any
instructions and if it is desired to test if the occurrence of new cavities is associated with the instructions.
STEPS
1. Test the null hypothesis
Set up a null hypothesis that “there is no difference between the two” and then proceed to test the hypothesis.
•Here we state the null hypothesis as ‘there is no association between oral hygiene instructions received in dental hygiene
and the occurrence of new cavities’
Group Occurrence of new cavities
Present Absent Total
Number who
received
instructions
10 40 50
Number who did
not receive
instructions
32 8 40
Total 42 48 90

•2. Then the X2 –statistic is calculated as,
X2 = Ʃ(O-E)/E
Where O is the observed frequency and E is the Expected Frequency
Expected Frequency (E) = Row total * Column total/Grand total
Among those who received instructions
Expected number attacked = 42*50/90 = 23.3
Expected number not attacked = 48*50/90 = 26.6
Among those who did not receive instructions
Expected number attacked = 42*40/90 = 18.2
Expected number not attacked = 48*40/90 = 21.3
Group Attacked Not Attacked
Number who received
instructions
O = 10
E = 23.3
O – E = - 13.3
O = 40
E = 26.6
O – E = 13.4
Number who did not receive
instructions
O = 32
E = 18.2
O – E = 13.8
O = 8
E = 21.3
O – E = - 13.3
Group Occurrence of new cavities
Present Absent Total
Number
who
received
instructi
ons
10 40 50
Number
who did
not
receive
instructi
ons
32 8 40
Total 42 48 90

3. Applying the X2 test,
X2 = Ʃ(O-E)2/E
= (-13.3)2/23.3 + (13.4)2/26.6 + (13.8)2/18.2 + (-13.3)2/21.3
= 7.59 + 6.75 + 10.46 + 8.3 = 33.1
4. Finding the degree of freedom (d.f)
It depends on the number of columns and rows in the original table
d.f = (c-1)*(r-1)
Where c = number of columns ; r = number of rows
d.f = (2 – 1)*(2 – 1) = 1
Group Attacked Not Attacked
Number who
received
instructions
O = 10
E = 23.3
O – E = - 13.3
O = 40
E = 26.6
O – E = 13.4
Number who
did not
receive
instructions
O = 32
E = 18.2
O – E = 13.8
O = 8
E = 21.3
O – E = - 13.3

5. Probability tables
Depending upon the value of “P” the conclusion is drawn.
• In the probability table, with a degree of freedom of 1, the X2 value for a probability (P) of 0.05 is 3.84. Since the
observed value 33 is much higher it is concluded that the null hypothesis is false and there is difference in caries
occurrence in the two groups with caries being lower in those who received instructions.

Z test
It is used to test the significance of difference in means for large samples (>30)
The pre-requisites to apply Z test for means are,
1. The sample must be randomly selected
2. The data must be quantitative
3. The variable is assumed to follow a normal distribution in the population
4. Sample should be larger than 30
Observation – mean / Standard deviation
= x – x / SD

STANDARD ERROR TEST FOR SMALL
SAMPLES
•A sample is considered to be small if it has less than 30 observations.
•The test applied is called the ‘t’ test
•Designed by W.S.GOSSETT, whose pen name was student. Hence this test is
called Student’s t-test
•When the investigations is in terms comparing the observations carried out on the
same individual says before and after certain experiment ,such comparison are
called paired comparison
•When the observation are carried out in two independent samples and their values
are compared it is known as unpaired comparison

CRITERIA FOR APPLYING ‘t’ TEST
•The sample must be randomly selected
•The data must be quantitative
•The variable is assumed to follow a normal distribution in population
•Sample should be less than 30

t- TEST FOR PAIRED COMPARISON
1. As per the null hypothesis, assume that there is no real difference between the means of
two samples
2. The difference between the before and after experimentation readings are calculated for
each individuals
3. The mean and standard deviation(s) of these differences are calculated
4. The standard error of this mean difference is calculated by the formula SE = SD/√n
5. t is calculated by the formula, t = Mean difference / Standard error of the difference
6. Find the degree of freedom (df) = (n-1) where n is the number of pairs of observation
7. From t- distribution table, find probability of t is noted down corresponding to (n-1) degree
of freedom
8. If probability is more than 0.05,the difference observed has no significance ,because it can
be due to chance

The unpaired ‘t’ test
1. As per the null hypothesis, assume that there is no real difference between the means of two
samples.
2. Find the observed difference between the means of two samples (X1 – X2)
3. Calculate the standard error of difference between the two means.
SE = √1/n1 + 1/n2
4. Calculate the ‘t’ value
t = X1
2 – X2
2 / SE
5. Determine the pooled degrees of freedom from the formula
d.f = (n1 – 1) + (n2 – 1) = n1 + n2 - 2

6. Compare calculated value with the table value (table of ‘t’) at particular degrees of freedom to find the level of
significance.

CONCLUSION
•Bio-statistical technique can assure that the results found in such a
study are not merely because of chance.
•In every case of our life, Statistics plays a major role for better
gaining and accurate results.
•A well designed and properly conducted study is a basic prerequisite
to arrive at valid conclusions.

REFERENCES
Soben peter ; Essentials of public health dentistry, 5th edition
K Park ; Parks Textbook of Preventive And Social medicine, 19th
edition
Joseph John ; Textbook of Preventive and Community Dentistry, 2nd
edition
Richard Levin & David S. Rubin ; Statistics for Management, 6th
edition

Biostatistics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Biostatistics

Similar to Biostatistics (20)

Recently uploaded

Recently uploaded (20)

Biostatistics