This document provides an overview of key concepts in biostatistics. It defines biostatistics as the application of statistical methods in the fields of biology, public health, and medicine. Some key points covered include:
- The types of data: qualitative, quantitative, discrete, continuous
- Descriptive statistics like mean, median, and mode
- Inferential statistics like hypothesis testing and estimating parameters
- Important statistical tests like t-tests, ANOVA, and chi-squared tests
- Measures of diagnostic accuracy like sensitivity, specificity, and predictive values
- The process of determining sample size for studies based on factors like confidence interval, power, and allowable error.
Definition ofbiostatistics
Data and its types: Qualitative/ Quantitative
Variable and its types
Mean/ median/ mode
Normal curve
Sn/ Sp/ PV
Sample and its types and calculation of sample size
Points to be covered…
2
3.
Science dealing withmethods of data
collection, compilation, tabulation and analysis
to provide meaningful and valid interpretation
3
Quality ofclinical & health planning decision
depends on Quality of information on which
they are based
• Medicine: A science with chance playing very
significant role
• Statistics help to quantify contribution of chance and
helps individual clinician make valid diagnostic,
prognostic or therapeutic decisions
• Helps programme managers+ policy planners to plan,
monitor+ evaluate public health initiatives
Remember this…
5
6.
Datum: Latin "fact"
Collectionof processed information
Sources of Data:
Primary Data : Collected and recorded by investigator/s
themselves by observation, interviews or measuring
instruments usually systematically and for defined
purposes
Secondary Data : Collected by somebody else or for
other purposes e.g. information derived from hospital
Data
6
7.
• Choice ofstatistical tests to be used depend on kind of variable
studied
• An attribute, quality, characteristic or property of persons or
things being studied that can be quantitatively measured or
enumerated
• Varies from person to person or from time to time in same
person
• Ex: Height, weight, age , gender, blood pressure, pulse rate,
smoking status
Variable
7
8.
A variablethat is manipulated or applied by
investigator or explains outcome
Eg: Maternal age, age at marriage, spacing between
successive pregnancies, pre-pregnancy weight,
weight gain during pregnancy
Independent (stimulus/ explanatory)
variable
8
9.
• Resulting responseor behaviour that is observed
when exposed to independent variable e. g.
Independent Dependent
Maternal age Birth weight
Birth spacing Birth weight
PIH Perinatal mortality
Dependent (Outcome/ response)
variable
9
Nominal: This variablehas mutually exclusive
categories and unordered
E.g. Blood group: A, B, AB, O,
Marital status: Unmarried, Married, Divorced,
Widowed
Ordinal: This variable has mutually exclusive
categories and ordered
E.g. Disease severity: Mild, Moderate, Severe,
Qualitative data
11
12.
Quantitative data
Discrete: Oftenrepresents counts
e.g. Number of children
Number of times admitted to hospital in the
last 5 years
Continuous: Can take any value within a range of
values
e.g. Height in cm
Weight in kg
Distance from home to work in km
12
13.
Importance of datatype
Type of data: Critically important in determining which
methods of analysis will be appropriate and valid
13
14.
Descriptive statistics
Describes basic features of data in a study
Provide summaries about sample
Inferential statistics
Investigate questions, models, and hypotheses
Infer population characteristics based on
sample
Make judgments about what we observe
Types of Statistics
14
15.
Descriptive Statistics
Univariate analysis(one variable at a time)
•Qualitative Data:
• Proportions or percentages
•Quantitative Data:
• Central tendencies
(Mean, Median, Mode)
• Measures of dispersion
(Range, Standard deviation, Coefficient of
variation, Percentiles, interqurtile range) 15
16.
Two approaches
1. EstimatingParameters: Process of using sample
information to draw conclusions about the value of a
population parameter
e.g., Proportion, mean, SD, Correlation
2. Testing Hypothesis
Inferential Statistics
16
17.
1. Point Estimates: Proportion, mean, SD, correlation
2. Interval Estimates:
Confidence Intervals: Define an upper limit & lower limit
with an associated probability
Ends of confidence intervals Confidence limits
95% confidence interval – 95% probability of containing
the population mean
99% confidence interval – 99% probability of containing
the population mean
Wider/ greater range of values must be included for greater
confidence
Estimating Parameters
17
18.
To permitgeneralizations from a sample to the population
from which it came
Steps in Hypothesis Testing
1. State the research question in terms of Statistical hypothesis
2. Decide on appropriate test statistic
3. Select level of significance
4. Determine value the test statistic must attain to be declared
significant
5. Perform calculations
6. Draw and State conclusions
Hypothesis Testing
18
19.
Null hypothesis, Ho Statement that no difference
or relationship
If related then Ho is rejected
If unrelated then Ho is retained (not accepted!!)
Alternative hypothesis, Ha Disagrees with Ho
Step 1: State the research question in terms of
Statistical hypothesis
19
20.
Statistics whoseprimary use is in testing
hypotheses are called test statistics
Parametric & non-parametric tests
Step 2: Decide on appropriate Test Statistic
20
21.
• Some assumptionsare to be met before a particular
test of significance can be applied to a set of data
• Sample measurements drawn from normally
distributed population of measurements in a
random manner
• Parametric tests are Student 't' test (paired and
unpaired), F test for analysis of variance,
correlation and regression analyses
Parametric Statistics
21
22.
Many naturallyoccurring events follow
a pattern with:
Many observations clustered around the
mean
Few observations with values away from
the mean
This bell-shaped curve was named
Normal distribution by a mathematician
Gauss
Normal distribution 22
23.
Normal distribution
The symmetrical clustering of values
around a central location
Normal curve
The bell-shaped curve that results when a
normal distribution is graphed
Normal distribution 23
24.
Normal Distribution
Developedby Karl F. Gauss (1777-1855) ‘Gaussian
distribution’
Called ‘normal’ because many continuous variables in biology
and other sciences follow this particular distribution
24
Skewed (or Asymmetric)Data
When the left and right-side sides of a frequency distribution do not
approximate mirror images, the data are said to be skewed or asymmetrical
Curve A Curve B
negative skew
positive skew
Mean>Median>Mode Mean<Median<Mode
28
29.
• Suitable alternativeparticularly when data is in
form of ranks or counts
• Chi- squared test: MC employed nonparametric test
• Wilcoxan Rank Sum, Mann Whitney U or median
test , Kruskal-Wallis 1-way and Friedman 2-way
analysis of variance
Non-Parametric Statistics or
Distribution Free Methods
29
30.
p value:Related to hypothesis test
Probability that the observed result is due to chance alone
Calculated after test has been performed
Small p-value (typically ≤ 0.05) indicates strong evidence
against null hypothesis, so you reject null hypothesis
P = 5% is not a rule written on stone
More generous (p=0.1)
More strict (p =0.01)
Step 3: Select the level of significance for the statistical
test
30
31.
Example
Study of theeffects of anticonvulsant therapy on serum calcium
concentration in the elderly.
• group of treated patients
• group of untreated patients
Outcome variable: serum calcium concentration
Independent variable: Anticonvulsant therapy
Null Hypothesis:
Both the groups (treated and untreated) have same mean serum calcium
conc.
Test of Significance :
. t test
31
32.
Direction of inquiry
Onsetof study Time
Exposed
Unexposed
Cases
Controls
Exposed
Unexposed
Case-control study (retrospective)
32
33.
Cases
Controls
Total
Ate Raw
Yes
17 (a)
7(c)
24
Hamburger
No
20 (b)
26 (d)
46
Total
37
33
70
Cross Product Ratio
2.3
)7)(20(
)26)(17(^
==
×
×
=
cb
da
OR
Controls
Cases
ddsO
ddsO
^
^
=
Case-control study: Outbreak of
Diarrheal Disease at a Resort Club
33
34.
Odds ratiois a ratio of two odds
Relative risk is a ratio of two probabilities
ODDS RATIO & RR
34
THREE KEY MEASURESOF
VALIDITY
1. SENSITIVITY
2. SPECIFICITY
3. PREDICTIVE VALUE
36
37.
True Disease Status
Screening/
Diagnostic
Test
PositiveNegative Total
Positive True Positives
(TP)
False Positives
(FP)
TP+FP
Negative False Negatives
(FN)
True Negatives
(TN)
FN+TN
Total TP+FN FP+TN TP+FP+FN+TN
Outcomes of a Screening/ Diagnostic
Test
37
38.
38
What is usedas a “gold standard”
1. Most definitive diagnostic procedure
e.g. microscopic examination of a tissue
specimen
2. Best available laboratory test
e.g. polymerase chain reaction (PCR)
for HIV virus
3. Comprehensive clinical evaluation
e.g. clinical assessment of arthritis
40
True Disease Status
CasesNon-cases
Positive
Negative
Screening
Test
Results
a
d
1,000
b
c
60
Sensitivity =
True positives
All cases
200 20,000
=
140
200
Specificity = True negatives
All non-cases
=
19,000
20,000
1,14
0
19,060
140
19,000
=
= 70%
95%
41.
41
Interpreting test results:
predictivevalue
Probability (proportion) of those tested who
are correctly classified
PPV = Cases identified
/all positive tests
NPV = Non-cases identified
/all negative tests
42.
Positive predictivevalue: Probability that subjects
with a positive screening test truly have disease
Negative predictive value: Probability that subjects
with a negative screening test truly don't have
disease
42
Study populationare large: Not possible to meet all population
members (Not feasible practically and costly)
Ideally: Study everyone so that we can generalize a finding to
study population
Cannot study everyone due to limited time and resources
If we cannot study everyone then we sample the population,
but in manner so that we can generalize the findings
Why Sampling?
47
Sample Size Determination
•To carry out any scientific study, MC asked
question: What should be minimum sample size?
• If sample is too small Fails to detect true
difference
•An exceedingly large sample size
•Wastage of time and money
•Will report tiniest relation/difference as significant
• Sample size should be calculated at planning
stage
Neither too small nor too large
51
52.
Need to knowfollowing:
Estimated prevalence/ SD
Confidence interval (95% CI)
Power: Ability to find significance when two
groups are really different (80%)
Allowable error or precision (5-10%)
Sample size calculation
52
Sample Size forQualitative
outcome variable
P = Estimated prevalence
(percentage)
Q =1-P
L = Allowable Error
2
4
L
PQ
n =
54
55.
Definition
P =Estimated prevalence (percentage)
From pilot study, published papers,
experience
Q =1-P
L = Allowable Error
L and Q and P are in same unit
55
56.
L; Allowable Error
Suppose, the survey wants to
estimate the true prevalence of a
disease in population
The estimate we get from the
survey will be within +/- L% of
the true prevalence
- L +L
56
57.
Example
A surveyis to estimate prevalence of
influenza virus infection in school kids
Suppose the available evidence
suggests that approximately 20%
(P=20) of the children will have
antibodies to the virus
Assume the investigator wants to
estimate the prevalence within 6% of
the true value (6% is called allowable
error; L) 57
58.
Example
The requiredsample size is
n = (4 x 20 x 80) / (6 x 6) =
177.78
Thus approximately 180 kids
would be needed for the survey
2
4
L
PQ
n =
Note: population size not involved in the formula
58
59.
Sample Size forEstimation of the
Mean (Quantitative outcome
variable)
S = Standard Deviation of the
parameter
L = Allowable Error
S and L are in the same unit
The average we find in the survey
will be within +/- L of the true
2
24
L
S
n =
- L +L
59
60.
Example
Suppose aninvestigator has some
evidence suggesting that the
standard deviation of rat weight
is about 455 g
He wishes to provide an estimate
within 80 g of the true average
(80 g is the allowable error; L)
60
61.
Example
The requiredsample size is
n = 4 x (455)2
/ (80)2
= 129.39
Thus approximately 130 rats would be needed.
2
24
L
S
n =
61
#15 Descriptive Statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. With descriptive statistics you are simply describing what is, what the data shows.
Inferential Statistics investigate questions, models and hypotheses. In many cases, the conclusions from inferential statistics extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population thinks. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what&apos;s going on in our data.
#17 Researchers focus on probabilities (often called p values) that fall lower end of continum. The reason for this is partly intuitive and partly historic.
#18 Researchers focus on probabilities (often called p values) that fall lower end of continum. The reason for this is partly intuitive and partly historic.
#19 Given a underlying theatrical structure, a representative sample, an appropriate research design, researcher. can test hypothesis. We test to see whether data support the hypothesis. Alterative hypothesis is called research hypothesis
#20 Given a underlying theatrical structure, a representative sample, an appropriate research design, researcher. can test hypothesis. We test to see whether data support the hypothesis. Alterative hypothesis is called research hypothesis
#21 Given a underlying theatrical structure, a representative sample, an appropriate research design, researcher. can test hypothesis. We test to see whether data support the hypothesis. Alterative hypothesis is called research hypothesis
#39 As noted, calculation of sensitivity and specificity, and therefore calculation of predictive value, requires a way to determine authoritatively who has and does not have the condition of interest. This “gold standard” is typically the most definitive diagnostic procedure (for example, the definitive diagnosis of cancer is generally based on microscopic examination of a tissue specimen), the best available laboratory test (for example, a polymerase chain reaction (PCR) test for the actual virus, as opposed to a test for antibody to the virus), or a comprehensive clinical evaluation, where there is no definitive laboratory test. For example, the best diagnosis for arthritis might be obtained through an examination.
#40 Data for estimating sensitivity and specificity are typically displayed in a 2 x 2 table that classifies people according to their disease status and test results. The above table has the True disease status along one dimension, with a column for cases and a column for non-cases, and the Test results on the other dimension, with a row for people who tested positive and a row for people who tested negative. In the top left-hand corner – the “a” cell – are the people who have the disease and whose test came up positive. They are “true positives”, cases who were correctly classified. In the lower right-hand corner – the “d” cell – are the people who do not have the disease and whose test came up negative. They are “true negatives”, non-cases who were correctly classified. The other two cells, b and c, contain people who were misclassified. Non-cases who nevertheless received a positive test are often called “false positives”, and cases who received a negative test are often called “false negatives”, but these terms are not always employed with these meanings.
If cell “c” is in the lower left-hand corner of the table, then the left-hand column – the cases – has a total of (a + c) people, and we we can write the formula for sensitivity as a / (a+c): the number of cases correctly classified divided by the total number of cases.
Similarly, the formula for specificity is d / (b+d): the number of correctly classified non-cases divided by the total number of non-cases.
#41 If a population has a total of 200 cases, and the test correctly identifies 140 of them as cases, then a = 140, a+c = 200, and the sensitivity is: a / (a+c) = 140 / 200 = 70%
If there are 20,000 people without the disease, and the test correctly classifies 19,000 of them as non-cases, then d = 19,000, b+d = 20,000, and the specificity is: d / (b+d) = 19,000 / 20,000 = 95%.
As is often the case for a rare disease, even with what seems like a high specificity (95%), the number of false positives can easily exceed the number of true positives. This observation brings us to the concept of predictive value.
#42 Sensitivity and specificity tell us what happens to cases and non-cases, respectively. However, appropriate interpretation of the results of a test – both screening tests and diagnostic tests – makes use of another concept that is very important for both the epidemiologic and the clinical perspectives, predictive value.
Predictive value is also a probability of correct classification, but here the starting point, the denominator for the probability, is the way people have been classified by the test. There are two types of predictive value – predictive value of a positive test and predictive value of a negative test. Predictive value tells us the probability that the test was correct. This is obviously a key question for the clinician (and the patient), since we generally do not know whether someone is a case or not, but we do know whether the person tests positive or negative.
In clinical epidemiology, the prevalence of a disease is referred to as the “prior probability” or “pretest probability”, since it is our initial estimate of the probability that the condition is present. Predictive values are referred to as posterior or posttest probabilities, since they provide estimates of probability that take into account the result of the screening or diagnostic test. The relation of the posttest and pretest probabilities indicates the informativeness of the test.
#44 The table for examining predictive value is the same as that for sensitivity and specificity. Instead of using the total numbers of cases and non-cases, though, predictive value involves the total number of people with a positive test and the total number with a negative test. Positive predictive value, abbreviated PPV or PV+, is the proportion of all people with positive tests who truly have the condition – a / (a+b) in the above table.
Negative predictive value (NPV or NP-) is the proportion of all people with negative tests who truly do not have the condition – d / (c+d) in the above table.
#45 Using the same numbers as in our example for calculating sensitivity and specificity, we find that the predictive value of a positive test (PPV) is only 140 / 1,140 = 12.3%. The predictive value of a negative test (NPV) is 19,000 / 19,060 = 99.7%.
Although the NPV is very high, that is not such an impressive result in this population, since the prevalence of the condition is only 200 / 20,200, which is not quite 1%. That means that if we select a person at random from the population, there is a 1% probability that the person will be a case (the pretest probability). The probability that a person who tests positive actually is a case is 12.3% (the posterior probability), so the test raises the probability substantially. On the other hand, the probability that a person randomly selected from the population does not actually have the condition is already 99%, so the additional information that a person tested negative cannot shift that estimate significantly.
However, the PPV of 12.3% poses a dilemma. Of the 1,140 people who tested positive, the vast majority – 87.7% – are falsely positive. They do not have the disease. Thus, for every person whose disease is detected and who may therefore be helped, 7 people who do not have the disease and will therefore not derive any benefit will undergo a diagnostic workup that may be costly, uncomfortable, and possibly harmful. This tradeoff is the dilemma in population screening for a rare disease.
#46 The above table illustrates the relation among positive predictive value (PPV), sensitivity, specificity, and prevalence of the condition. Note that sensitivity and specificity are being regarded as properties of the test, unaffected – in principle – by the rarity of the condition. In contrast, prevalence is a property of the population in which the test being screened, and PPV shows the result of applying a test with given sensitivity and specificity to a population with a given prevalence.
For sensitivity held constant at 70% and specificity held constant at 95%, PPV is only 1.4% for a disease with a prevalence of 1 in 1,000, but rises to over 40% when the prevalence is 5%. This table illustrates the difference between using a test for screening and for diagnosis. Using the test in the general population, where the disease is rare (say, less than 1%), will result in a positive predictive value below 15% – the large majority of people who test positive will not have the condition. In contrast, people with symptoms are much more likely to have the condition. If the prevalence among them is above 5%, then the proportion of false positive tests is greatly reduced.
The challenge in population screening is to try to target a population at sufficiently high risk that the number of false positives is acceptable and yet a sufficient proportion of the cases are included.
A point often not mentioned in introductory presentations is that while sensitivity and specificity are in principle fixed properties of the test, in practice a test is not a fixed entity. Various factors can affect the sensitivity and specificity of a test when it is actually implemented, since there are often human factors involved in interpreting test results, equipment may require frequent calibration, etc.
#48 Suggestions to the facilitator
Ask participants why and when sampling is required and explain.