2. INTRODUCTION
Statistics may be defined as the discipline
concerned with the treatment of numerical data
derived from group of individuals.
Biostatistics is a branch of statistics applied to
biological or medical sciences.
Consists of various steps like generation of
hypothesis, collection of data, and application of
statistical analysis.
3. Two major branches : descriptive and inferential.
Descriptive statistics explain the distribution of
population measurements by providing types of data,
estimates of central tendency (mean, mode and
median), and measures of variability (standard
deviation, correlation coefficient)
Inferential statistics is used to express the level of
certainty about estimates and includes hypothesis
testing, standard error of mean, and confidence
interval.
4. DATA
Observations recorded during research constitute
data.
The two main types of data:
Qualitative
Quantitative
most studies will have a combination of both
5. Qualitative Data/Categorical
variables that do not have a numerical value.
They usually describe a meaning and give a
name or label to variables.
2 TYPES: nominal and ordinal
6. Quantitative Data
These are variables that are truly numerical. Quantitative
data (interval data) may be discrete or continuous
A continuous variable can take any value within a given
range
eg: hemoglobin (Hb) level may be taken as 11.3, 12.6,
13.4
gm %
Discrete variable is usually assigned integer values i.e.
does not have fractional values.
Eg:blood pressure values are generally discrete variables
or
number of cigarettes smoked per day by a person.
7. DATA COLLECTION
Sample refers to the subjects chosen from the
population for investigation. It should be ensured that
the sample is representative of the whole population
Need for Sampling :
makes it easier and more economic than studying the
whole population
saves time, manpower, cost, increases efficiency.
If sample size is too small: it will not give us valid results
If too large: more cost and manpower
8. Methods of sampling
Random sampling/
probability sampling
Non random / non
probability sampling
Simple random
sampling
Stratifies random
sampling
Systematic random
sampling
Multistage sampling
Multiphase sampling
Cluster random
sampling
Convenience
sampling
Contact sampling
Quota sampling
Volunteer sampling
Snow ball sampling
9. Random sampling
Simple random: chosen randomly, entirely by chance.
Each individual has same probability of being chosen.
Methods: lottery, random number tables, computer
generated.
Systematic random: indivuduals in the population are
arranged in a certain manner and a random starting
point is selected and every nth indivudual is selected.
“n” is the sampling interval ie: total number of units in
the population/ total number of units in the sample.
Some indivuduals have larger probability of being
chosen.
10. Stratified random: initially whole population is stratified
(Non homogeneous population is converted to
homogeneous groups) and the systematic random
sampling applied to each strata.
Eg: for a sample of 100 from a population of 1000
(heterogeneous)
first divided into homogeneous strata (ie: 700 males, 300
females),
then select 70 males and 30 females randomly.
Multistage sampling: done in successive stages. Each
sampling unit is nested in the previous sampling unit.
Eg: in large country surveys, states are chosen, then
districts , then every 10th person as final sampling unit.
11. Multiphase sampling: done in successive phases ie: part
of information is obtained from whole population and
part from subsample.
Eg: in a Tb survey, Mantoux test done in first phase,
then Xray
done in all mantoux positives, then sputum tested in all
Xray
positives.
Cluster sampling: applicable when units of population
are natural groups or clusters. All indivuduals in the
cluster are selected as a whole.
12. Non random sampling
Selection based on expert knowledge of the population.
Cannot be assured that each item has equal chances of
being selected.
Convenience sampling: patients selected at the
convenience of the researcher. Eg: selecting shoppers in a
mall as the walk by to fill out a survey.
Quota sampling: population segmented into mutually
exclusive groups, then judgement is used to select units.
Snow ball sampling: existing study subjects recruit future
subjects from their acquaintances, thus sample group
appears to grow like a snowball. Used for hidden
population which are difficult to access like drug abusers,
commercial sex workers.
13. DATA PRESENTATION
Raw data can be presented in three different ways
tabular,
graphic (chart),
numerical (descriptive statistics) forms.
14. Tabular form
Frequency, cumulative frequency, relative
freqency tables
These methods can be used to present all
different types of variables including nominal,
ordinal and quantitative data. In order to present
continuous data by this mode it needs to be
arranged into groups (intervals) first.
15.
16. Graphic (Chart) Presentation
Pie chart
Bar chart
Histograms
Frequency curves
Cumulative frequency curves ( ogive )
Scatter plots
17. Pie chart
useful to show the
proportion of different
groups that constitute
the total sample
The whole pie
represents the total
sample while the size
occupied by each
group will be
proportional to their
number.
used for ordinal and
nominal data.
18. Bar charts
Bar charts are used to
compare different
classes of data.
The x axis is usually
dimensionless while the
y axis represents the
frequency of each class.
Each class could
represent a single
group, or be further
divided into subgroups
19. HISTOGRAMS
specialized bar chart used
to give a visual
presentation of interval
data.
Quantitative data, and in
particular continuous data,
are divided into intervals in
order to be integrated into
frequency tables.
USES:
To show the mode of
distribution of the data.
To demonstrate descriptive
statistics like mean, mode
and standard deviation
20. Frequency Curves
• These are very
similar to
histograms, but
without the bars.
• Advantage is that
they can be used to
compare the
distribution of 2 or
more groups on the
same chart
21. Ogive
Data values represented on the horizontal axis
and either the cumulative frequencies, the
cumulative relative frequencies or cumulative
percent frequencies on the vertical axis.
This type of graph is useful to identify the
proportion of a sample that falls below or above
certain limit.
22. Scatter plots
• These are used to
determine if there is
any relationship
between two sample
variables.
• strength of the
relationship can be
calculated using a
correlation
coefficient.
23. Numerical Presentation (Descriptive Statistics)
Main aim: present a meaningful summary of the
sample data rather than drawing conclusions
about the whole population
Three key characteristics are distribution, central
tendency and measurements of spread.
2 types of frequency distribution:
Normal/ Gaussian
Non normal/ non Gaussian
24. Gaussian /normal distribution/ symmetrical
If data is symmetrically distributed on both sides of mean and
form a bell-shaped curve in frequency distribution plot, the
distribution of data is called normal or Gaussian.
The normal curve describes the ideal distribution of
continuous values i.e. heart rate, blood sugar level and Hb %
level.
The normal (parametric) distribution is characterized by a
single peak (unimodal) and a symmetrical spread of variables
on either side.
All central tendency measures (mean, mode and median) are
equal in a normal distribution and they are represented by
the point of maximum frequency. The spread of data is equal
on either side, which represents standard deviation (SD).
25. In an ideal Gaussian
distribution, the values
lying between the points
1 SD below and 1 SD
above the mean value
(i.e. ± 1 SD) will include
68.27% of all values.
The range, mean ± 2 SD
includes approximately
95% of values distributed
about this mean,
excluding 2.5% above
and 2.5% below the
range.
Methods of analysis :‘t’
test and analysis of
Mean = median = mode
Tot area of the curve = 1
SD, Variance = 1
Skew is zero.
27. If the difference (mean–median) is positive, the curve
is positively skewed and if it is (mean–median)
negative, the curve is negatively skewed, and
therefore, measure of central tendency differs.
Measures of skewness: Karl pearson measure,
bowley’s measure, kelly’s measure, moment’s
measure
29. MEASURES OF CENTRAL
TENDENCY
An estimate of the "center" of a distribution of values.
The three central tendency measures are mean, median
and mode
Mean
Total sum of the values divided by the number of
variables (arithmetic mean).
Used for parametric data and should not be used to
report central tendency of ordinal or nominal data.
Eg: Suppose height of 7 children’s is 60, 70, 80, 90, 90,
100, and 110 cms. mean(X) = Σx/n=600/7=85.71.
Most affected measure if outliers ( extreme values ) are
present
30. Median is the middle value when all the data are
arranged in numerical order. This means that 50% of
the data are below and 50% above that value.
This is preferable to measuring central tendency in
nonparametric
data since it is less affected by outliers than the mean.
Mode is the most frequently occurring observation in a
set of data. It is not a good indicator of central tendency
but it is the only way for measuring central tendency in
nominal and ordinal data.
Least affected by outliers.
In bimodal distribution, mode = 3 median - 2mean
31. Measures of Spread/dispersion
Absolute ( have units ) Relative ( no units )
Range
Mean deviation
Standard deviation
Quartile deviation
Coefficient of range
Coefficient of Quartile
deviation
Coefficient of Mean
deviation
Coefficient of variation
= (SD/mean)*100
32. Measures of Spread/dispersion
Range is the simplest measure of spread, but with limited
practical use. It is the difference between the maximum and the
minimum value in a data set.
Variance is calculated from the sum of the square of difference
of each value from the mean divided by the total study
population.
The SD is the square root of the variance. Standard deviation
(SD) describes the variability of the observation about the mean.
Also called root-mean-square value
If sample size is >30, denominator is (ή-1)
33. Percentiles are the main measures of non-parametric data
spread.
Eg: tertiles ( 4 equal parts ), quartiles, pentiles, hextiles,
heptiles, octiles, deciles, centiles ( 100 equal parts )
Quartiles are self explanatory:
the 1st quartile has 25% of the data below it, the 2nd
quartile corresponds to the median and has 50% of data
below it, and the 3rd quartile has 75% of data below it.
Eg: Percentiles are used in WHO growth chart. Upper
reference curve is 50th centile for boys and lower reference
curve is 3rd centile for girls. Road to health is the space
between these 2 curves which indicates normality ( 95% of
healthy normal children fall in this area )
34. Standard error of mean
Since we study some patients (sample) to draw conclusions
about all patients or population and use the sample mean (M) as
an estimate of the population mean (M1) , we need to know how
far M can vary from M1 if repeated samples of size N are taken.
Standard error of mean SEM = SD/√n
SEM is always less than SD.
Measure of difference between sample and population values.
Uses:
Determine limits of confidence within which the mean would lie
Determine if a sample is drawn from a known population or not
Calculate sample size
35. For example, take fasting blood sugar of 200 lawyers.
Suppose mean is 90 mg% and SD = 8 mg%.
SEM = SD/√n=8/√200=8/14.14=0.56.
Mean fasting blood sugar + 2 SEM = 90 + (2 x 0.56) = 91.12
Mean fasting blood sugar - 2 SEM = 90 - (2 x 0.56) = 88.88
So, confidence limits of fasting blood sugar of lawyer’s
population are 88.88 to 91.12 mg %. If mean fasting blood
sugar of another lawyer is 80, we can say that, he is not
from the same population
Confidence Interval (CI) OR (Fiducial limits): Confidence
limits are two extremes of a measurement within which
95% observations would lie.
37. Correlation
Measure of degree of linear relationship between two
continuous variables. It is represented by ‘r’.
The association is positive if the values of x-axis and y-axis
tend to be high or low together.
The association is negative i.e. -1 if the high y axis values
tends to go with low values of x axis and considered as
perfect negative correlation.
Larger the correlation coefficient, stronger is the
association.
EG: correlation between height and weight, age and height,
weight loss and poverty, parity and birth weight,
socioeconomic status and hemoglobin.
38. The correlation coefficient values are always between -1
and +1. If the variables are not correlated, then
correlation coefficient is zero
Correlation is represented by scatter diagram.
Pearson coefficient : for Gaussian distribution
Spearman coefficient: for non Gaussian distribution
39. Regression
Provides structure of relationship between 2 quantitative
variables.
Regression coefficient(b) : measures change in a dependant
variable(y) with change in independent variable(x) /variables
(x1,x2,x3)
Types of regression:
simple linear ( 1 dependant and 1 independent variable),
multiple linear ( 1dependant and more than 1 independent
variable)
simple curvilinear (1 dependant and 1independent variable with
some power of independent variable)
multiple curvilinear (1 dependant and more than 1 independent
variable with some power of independent variable)
40. Null Hypothesis
The primary object of statistical analysis is to find out
whether the effect produced by a compound under study is
genuine and is not due to chance.
First step in such a test is to state the null hypothesis.
In null hypothesis (statistical hypothesis), we make
assumption that there exist no differences between the two
groups.
Eg: ‘drug A is not better than the placebo’
Alternative hypothesis (research hypothesis) states that
there is a difference between two groups.
Eg: ‘there is a difference between new drug ‘A’ and placebo.’
41. When the null hypothesis is accepted, the difference
between the two groups is not significant.
If alternative hypothesis is proved i.e. null hypothesis is
rejected, then the difference between two groups is
statistically significant.
A difference between drug ‘A’ and placebo group,
which would have arisen by chance is less than five
percent of the cases, that is less than 1 in 20 times is
considered as statistically significant (P < 0.05).
43. Errors in estimation
Random error Systematic error
Error in measurement
ie: measured values
are inconsistent when
repeated measures of
a variable are taken
Unpredictable.
Considered as ‘noise’
Precision is the
opposite
Doesn’t affect average,
but affects variability
around mean
Caused by any factor
which systematically
affects measurements
of variable
Affects the mean
Called as bias
Opposite is accuracy
(validity)
44. Precision: degree to which repeated measurements
show same or similar results
Also called repeatability, reliability, consistency,
reproducibility
Accuracy: degree of closeness of a measured value to
its actual/true value
45. Any systematic error in an epidemiological study
occuring during data collection, compilation, analysis,
or intepretation.
Predominantly 3 types:
1. Subject bias eg: hawthorne bias, recall bias
2. Observer/ investigator bias eg: selection bias,
berkesonian bias
3. Analyser bias
Bias
46. Hawthorne/ attention bias: subjects may alter their
behavior when they know they are being observed.
Apprehension bias: certain variables ( BP, Heart rate)
may alter from usual levels if subject is apprehensive
Berkesonian bias/ admission rate bias: bias due to
hospital cases and controls being systematically
different from each other
47. Selection bias: Selection bias occurs as a result of
patients declining to take part in a clinical trial and
therefore those who do take part may differ in some way.
Publication bias: Studies with positive or statistically
significant results are more likely to be published by
scientific journals compared with studies yielding
negative trials.
48. Measures to minimise bias
Blinding
- Single blinding eliminates subject bias
- Double blinding eliminates subject and observer bias
- Triple blinding eliminates sunject, observer and analyzer
bias
Randomization – eliminates selection bias
Randomization ensures that the two groups are comparable
and that the only difference between them is the intervention
of interest.
Matching – eliminates confounding
49. Errors in analysis
Type I error ( false rejection of null hypothesis, FALSE
POSITIVE)
Also known as α error.
It is the probability of finding a difference when no such
difference actually exists.
Type I error can be made small by changing the level of
significance and by increasing the size of sample.
Eg: we proved in our trial that new drug ‘A’ has an analgesic
action and accepted as an analgesic. If we commit type I
error in this experiment, then subsequent trial on this
compound will automatically reject our claim that drug ‘A’ is
having analgesic action
50. Probability of type 1 error is given by P value.
Significance level or α level is the maximum tolerable
level of type 1 error. α level is fixed in advance.
If p value < significance level, results are declared
statistically significant.
Most commonly P value less than 0.05 or 5% is
considered as significant level. If we may adopt a
different standard like P < 0.01 or 1% then, type 1 error
will be reduced.
51. Type II Error (false acceptance of null hypothesis, FALSE
NEGATIVE)
This is also called as β error.
It is the probability of inability to detect the difference when
it actually exists.
This error is more serious because once we labelled the
compound as inactive, there is possibility that nobody will
try it again.
Minimized by taking larger sample and by employing
sufficient dose of the compound under trial.
Most medical research will accept a β value of 0.2
Study power is the probability that it will detect a statistically
significant difference if one exists. It is calculated as (1-β).
Acceptable power of a study: 0.8
52. Other measures of probability
Odds ratio
This is used to measure the effect of certain intervention on
the probability of an event happening.
An odds ratio of 1 mean there is no significant difference
between the 2 groups.
In a case control study, OR is calculated from the 2 by 2
table
OR ( Cross product ratio) = ad/bc
Interpretation of OR : >1 – associated, =1 – not associated,
<1 – has protective effect
53. Risk ratio
Risk ratio is very similar to odds ratio. In risk ratio calculations
that the denominator is the total population.
Relative risk = incidence among exposed/ incidence among
non exposed ie:
[a/(a+b)]/[c/(c=d)]
Attributable risk: indicates to what extent the disease can be
attributed to the exposure =(incidence among exposed-
incidence among non exposed)/ incidence among exposed
54. Sample Size Determination
Factors Influencing Sample Size Include:
1) Prevalence of particular event or characteristics- If the
prevalence is high, small sample can be taken and vice versa.
If prevalence is not known, then it can be obtained by a pilot
study.
2) Probability level considered for accuracy of estimate- If we
need more safeguard about conclusions on data, we need a
larger sample. Hence, the size of sample would be larger when
the safeguard is 99% than when it is only 95%.
3) Availability of money, material, and manpower.
4) Time bound study curtails the sample size
55. Sample Size Determination and Variance Estimate
Formula requires the knowledge of standard deviation or
variance.
Frequently used sources for estimation of standard
deviation are:
A pilot or preliminary sample may be drawn from the
population, and the variance computed from the sample
may be used as an estimate of standard deviation.
Observations used in pilot sample may be counted as a
part of the
final sample.
From the previous or similar studies
56. 5 points are to be considered very carefully.
1. Assess the minimum expected difference between the
groups.
2. find out standard deviation of variables.
3. set the level of significance (alpha level, generally set
at P < 0.05) and Power of study (1-beta = 80%).
4. select the formula from computer programs to obtain
the sample size. Various softwares are available free
of cost for calculation of sample size and power of
study.
5. Lastly, appropriate allowances are given for non-
compliance and dropouts, and this will be the final
sample size for each group in study.
57. Power of Study
It is a probability that study will reveal a difference
between the groups if the difference actually exists.
Power of study is very important while calculation of
sample size.
Any study to be scientifically sound should have at least
80% power. If power of study is less than 80%,
probability of missing the difference is high
If we increase the power of study, then sample size also
increases.
58. STATISTICAL TESTS
parametric tests (for gaussian distribution)
non-parametric tests (for non-Gaussian distribution)
Non-parametric tests are less powerful than parametric
tests. Generally, P values tend to be higher, making it
harder to detect real differences.
Few systematic steps should be followed to establish the
appropriate test for a data.
1. Identify whether the data are Qualitative or Quantitative
2. For Quantitative data, determine the type of distribution
3. Decide how many groups are being compared
4. Determine whether the data is paired or not.
59.
60. Student’s ‘t’ Test
Applied for analysis when the number of sample is 30 or
less. If sample size is more than 30, ‘Z’ test is applied.
It is usually applicable for graded data like blood sugar
level, body weight, height etc.
When comparison has to be made between two
measurements in the same subjects after two consecutive
treatments, paired ‘t’ test is used. Eg: when we want to
compare effect of drug A (i.e. decrease blood sugar) before
start of treatment (baseline) and after 1 month of treatment
with drug A.
61. When comparison is made between two
measurements in two different groups, unpaired
‘t’ test is used.
For example, when we compare the effects of drug
A and B (i.e. mean change in blood sugar) after one
month from baseline in both groups, unpaired ‘t’
test’ is applicable.
62. ANOVA
One way ANOVA
It compares three or more unmatched groups when the
data are categorized in one way.
For example, to compare a control group with three
different doses of aspirin in rats. Here, there are four
unmatched group of rats.
Two way ANOVA
Determines how a response is affected by two factors.
For to measure response to three different drugs in both
men and women.
63. Chi-square test
The Chi-square test is a non-parametric test of
proportions.
Two events can often be studied for their association
such as smoking and cancer, treatment and outcome of
disease, level of cholesterol and coronary heart disease
Test measures the probability (P) or relative frequency of
association due to chance and also if two events are
associated or dependent on each other.
Though, Chi-square test tells an association between two
events or characters, it does not measure the strength of
association.
65. Types of clinical studies
RETROSPECTIVE PROSPECTIVE
Case control
Cross sectional
Cohort
Randomized/ non
randomized
interventional control
trials
66. Retrospective studies
Retrospective studies look backward in time and
select study groups based on their exposure to a
risk or protective factor in relation to an outcome
established at the start of the study
Useful in:
rare conditions when a prospective approach
would take too long
significant lag period between exposure and
disease
situations where a prospective investigation may
be unethical
insufficient evidence to justify an interventional
67. Advantages: Retrospective studies are relatively
inexpensive and can utilize existing databases and
registers.
Disadvantages:
recall bias
not possible to randomize the groups – confounding
factors may be present
68. Cross-sectional studies/prevalence study
Examine either a random sample or all of the subjects in
a well-defined study population in order to obtain the
answer to a specific clinical question. They include
surveys and studies which examine the prevalence of a
disease.
Prevalence = (new+old cases/total population)*100
Case–control studies
Patients with a specific disease or condition are selected
and matched to a control group. The cases and controls
are then compared for potential risk factors or causative
agents implicated in the aetiology of the disease.
Disadv: various types of bias
69. Observational cohort studies
Cohort studies involve the selection of two or more groups
and their subsequent follow-up over a number of years.
The groups are selected based on the differences in their
exposure to a particular agent and patients are followed
up to see who develops the illness.
Incidence = (number of new cases/ tot population at
risk)*1000
Randomized and non-randomized (cohort) interventional
controlled trials
It evaluates an intervention rather than merely observing
two or more groups over time. Systematic bias should be
avoided.
The groups being compared should ideally only be
70. Basic steps in conducting a RCT:
1. Drawing up a protocol
2. Selecting reference and experimental population
3. Randomization
4. Intervention
5. Follow up
6. Assessment of outcome
71. Sensitivity
Ability of the test to correctly identify those
patients with the disease.
A test with 80% sensitivity detects 80% of
patients with the disease (true positives) but 20%
with the disease go undetected (false negatives).
A high sensitivity is clearly important where the
test is used to identify a serious but treatable
disease (e.g. cervical cancer).
72. Specificity
Ability of the test to correctly identify those
patients without the disease.
A test with 80% specificity correctly reports 80%
of patients without the disease as test negative
(true negatives) but 20% patients without the
disease are incorrectly identified as test positive
(false positives).
73. Positive predictive value
The PPV of a test determines how likely is it that
a patient has the disease given that the test result
is positive.
PPV is more accurate if the prevalence of the
disease in the population is high.
74. Negative predictive value
The NPV of a test determines how likely is it that
a patient does not have the disease given that the
test result is negative
Likelihood ratio is defined as how much more
likely is it that a patient who tests positive has the
disease compared with one who tests negative