COM 301 INFERENTIAL STATISTICS SLIDES.ppt

AN OVERVIEW OF
INFERENTIAL STATISTICS
BY
EMMANANUEL J.O
DEPARTMENT OF COMMUNITY MEDICINE PRINCE
ABUBAKAR AUDU UNIVERSITY ANYIGBA
4/8/2024 1

Outline
 What is Inferential Statistics ?
 Epidemiological Study Designs
 Sampling Methods
 Sampling Distributions
 Methods of Inferential Statistics
 Correlation and Regression Analyses
 Exercises
 Conclusion
 Bibliography
4/8/2024 2

Objectives of the Presentation
 To present a concise and straightforward overview of
the basic methods and techniques of medical statistics
 To put the multitude of statistical methods applicable
to medical research into their practical context
 To combine simplicity and depth in doing so
 Hopefully to improve the statistical rigors of our
scientific publications
To promote the growth of evidence based medicine
 Are your expectations captured ?
4/8/2024 3

What is Inferential statistics (I.S.) ?
Concerns decision making on the general
population based on data collected from a sample
(i.e. a subset or part of a population)
 I.S INFER the true finding (s) in the larger
population based on findings in the sample using
the P-Values and the Confidence Intervals ( CI )
We INFER the parameter from the statistic
We generalize findings from sample (s) to the
larger population
I.S. therefore relies on the statistical properties of
sample estimates
4/8/2024 4

The Process of Making a Statistical
Inference
Sample
(statistic)
P-values
Confidence
Intervals
Inference
Start from
the
POPULATION
( parameter )
4/8/2024 5

Validity of Results
• Internal Validity
– Conclusion supported by study designs?
• External validity
– Generalizable to reference population?
4/8/2024 6

Some Epidemiological Study Designs
Epidemiological
study designs
Observational-
Descriptive
Analytic
Experimental-
RCTs (individual& community)
Clinical trials
4/8/2024 7

Probability (or Random Sampling
Methods)
– The chance of selecting every unit in the population
is known/ equal
– The sampling error can be estimated and may be very
small
– Outcomes of studies can be generalized to the larger
population
4/8/2024 8

Examples of Probability (or Random
Sampling Methods)
1. Simple Random Sampling
2. Systematic Random Sampling
3. Stratified Random Sampling
4. Cluster Sampling
5. Multi-phase Sampling
6. Multistage Sampling
4/8/2024 9

Non probability (or Non-Random
Sampling Method)
• The chance of selecting every unit is not
known/Unequal
• Outcomes of studies cannot be generalized to the
larger population
4/8/2024 10

Examples of Non-Probability Sampling
Methods
1. Volunteers
2. Convenience
3. Purposive
4. Quota
5. Snowball
6. Haphazard
4/8/2024 11

Exercise
What sampling method (s) would you use in the
following studies?
1. Selection of 100 women attending ANC at the clinic
2. Selection of 150 under 5 children in a nursery
school for a study on malnutrition
3. Selection of 100 men into a clinical trial to test the
effect of their wife’s presence during HCT
4/8/2024 12

Sampling Distributions
• Most events of interest can be described using
probability distributions e.g. the normal or Gaussian
distribution curve
• I.S. therefore uses probability concepts and sampling
theory
• Inferences are drawn based on comparing observed
data (with expected values i.e. Ho) based on some
sampling distributions such as the Z, t, F, & Pearson’s
Chi square tests etc
4/8/2024 13

Review: Some sampling distributions
test statistics
4/8/2024 14

Types of probability distributions
 Discrete probability distributions
I. Binomial distribution (for dichotomous
outcomes where the events of interest are
independent)
II. Poisson distribution (for rare events e.g. a plane
crash)
III. Cox distribution (for analysis of survival data)
 Continuous probability distributions
I. Normal distribution (for quantitative
continuous variables)
4/8/2024 15

Review : The Normal Distribution
Curve
The most widely used probability distribution
Many significance tests or hypothesis testing make
the assumption that the data set collected follows
this distribution
Estimates can be computed from samples
irrespective of the nature of the variable (qualitative
or quantitative) as they follow or may be transformed
to follow the normal distribution ( = Central limit
theorem)
The normal distribution plays a major role in
statistical inference
4/8/2024 16

The Normal Distribution Curve
• 68%, 95% and 99 % lie within +/- 1,2 and 3 SD
respectively
• µ-3σ µ-2σ µ-σ µ µ+σ µ+2σ µ+3σ
4/8/2024 17

Skewed to the Right (Positive
Skewness)
4/8/2024 18

Skewed to the Left (Negative
Skewness)
4/8/2024 19

Methods of Inferential Statistics
1. Hypothesis testing (Ho) or Significance
Testing
2. Estimations of magnitude of effect
a) Point estimations e.g. p- values
b) Interval estimations e.g. 95% CI
Caution !
I. Biological Plausibility
II. Confounding
4/8/2024 20

Steps involved in Hypothesis Testing/
Significance Testing
1. State the NULL Hypothesis (Ho)
2. State the ALTERNATIVE Hypothesis (Ha)
3. Set the ALPHA ( ᾳ ) level
4. Select and perform the appropriate statistical test
e.g. Student t-test, Paired t-test or Chi-square etc
5. Calculate the P-Value from the test statistic
6. Decide statistical significance ( Result due to chance
or not)
7. Conclude (Clinical Significance )
4/8/2024 21

General format for ALL test statistics
Test Statistic = Observed Value (O) minus Expected
Value (E= Ho) Divide by Standard Error (SE)
O – E/S.E = p- value
S.E of the sample mean = sample SD/square root of
n (where n = no of samples taken from the pop.)
Used for 1 sample Z test, 1 sample t-test, 2 sample t-
test, Paired t-test, Pearson’s Chi square test etc
The p-value may be calculated manually or by using a
statistical software (e.g. SPSS, STATA, EPI-INFO )
4/8/2024 22

Point Estimations (P-Values)
P-value is the probability of getting a difference at least as
big as that observed if the NULL hypothesis (Ho) is TRUE
This means the smaller the P-value, the lower the chance
of getting a difference as big as the one observed if the
(Ho) were true
It also means the smaller the P-value e.g. < 0.05, the
stronger the evidence against the NULL hypothesis (Ho)
By convention the 2-sided/tailed P-values are used
A guide to tell us that a result is “significant”
Generally at the 95% CI level /Rarely 99 % CI level
4/8/2024 23

Point Estimations (P-Values) 2
• When P < 0.05 is Significant at the 95% CI level, it
means that there is a 95% probability that the result
is true or valid (NOT by chance)
• Example: P-value < 0.01 (Signif @ 99% CI)
• Example: P-value = 0.36 (Not Signif @ 95% & 99% CI)
4/8/2024 24

Common Mistakes in the
Interpretation of P-values
Do not ignore all P-values > 0.05 especially in studies
with small sample size because statistically non
significant differences are NOT always clinically or
medically non significant. Check the CI range as well.
At least 1 in 20 comparisons in which the Ho is true
will report a false P-value < 0.05, especially with
studies involving treatment effects
A larger sample size detects even an extremely small
difference in a population. So do not hurriedly accept
the Ho
4/8/2024 25

Confidence Intervals (CI)
CI is a range of possible values for the true value of
the parameter being estimated
The parameter could be mean, mean difference,
odds ratio, difference in proportion etc
A 95% CI gives the interval within which the true
value of the estimate lies with about 95% certainty
A 99% CI gives the interval within which the true
value of the estimate lies with about 99% certainty
4/8/2024 26

Confidence Intervals (CI)
CIs are used with risk ratios or relative risks (RR) and
odds ratios (OR)
CI tells us about both precision and accuracy of our
estimates
With an OR or RR we can estimate the magnitude of
the association between variables
E.g. 95% CI tells us that we can be 95% sure or
‘confident’ that the true association is somewhere in
that interval
Example: OR = 7, 95% CI= (5.2 - 8.8) or ( 5.2, 8.8)
Example: OR = 7, 95% CI= (0.4 -18.7) or ( 0.4, 18.7)
4/8/2024 27

Interpretation of Confidence Intervals
(CI)
CI always agree with the P- values
The inclusion of the null value (ZERO) of the
parameter in the CI means non significance i.e. P-
value is < 0.05 (and vice versa)
Because Z value of 1.96 (95 % CI) corresponds to a P-
value of 0.05
This means that if p < 0.05, then 95% CI will not
contain a ZERO value
The size of the P-value also depends on the SAMPLE
SIZE
4/8/2024 28

Interpretation of Confidence Intervals
(CI)
• CI for difference in means
-3.5 to 8.9 (not significant) = P-value > 0.05 (or 0.01)
5.8 to 11.5 (significant) = P-value < 0.05 (or 0.01)
4/8/2024 29

Exercise: Interpretation of Confidence
Intervals (CI)
• CI for correlation coefficient
- 0.3 to 0.6 (significant ?)
0.5 to 0.72 (significant?)
• CI for odds ratios
- 0.12 to 3.67 (significant?)
3.67 to 5.89 (significant ?)
4/8/2024 30

Reasons for observed difference/
association
1. Chance (Ruled out by hypothesis or significance
testing)
2. Confounding e.g. smoking, lung cancer & asbestosis
3. Interation ( Effect modification )
4. Spurious factors (Bias) e.g. selection & information
bias
4/8/2024 31

Use of 2-By-2 Tables to Calculate OR
and RR
SICK WELL TOTAL
Exposed a b a +b
Unexposed c d c+d
Total a+c b+d N
4/8/2024 32

Use of 2-By-2 Tables Cont’d
• Odd Ratio = ad/cb
• Relative Risk = a (c+d)/c(a+b)
4/8/2024 33

ODDS RATIO
4/8/2024 34
OR = 12 x 17/2 x 5
= 204/10
OR = 20.4

Interpreting odds ratios and
confidence intervals
• Odds ratios measure association between 2
qualitative or categorical variables
• OR values range from zero to infinity !
• It is >1 when the association is positive (Risk factor ?)
• It is <1(a decimal) when the association is negative
(Protective factor?)
• It = 1 when there is no association i.e. odds in the 2
groups are the same
4/8/2024 35

Interpreting OR and CI
The OR is always further away from 1 than the
corresponding RR (or prevalence ratio/Risk ratio/Cross
Product Ratio):
If RR>1 then OR is further > 1 ; if RR< 1 then OR is
further < 1
For rare outcomes the odds are approximately equal to
the risks (OR approx = RR)
The OR for the occurrence of disease is the reciprocal
of the odds ratio for non occurrence of the disease
ORs are fundamental in the analysis of Case-Control
studies
4/8/2024 36

Interpreting CIs for odds ratios
CI s for ORs are significant (P < 0.05) when the
interval does not include 1
– Examples 0.23 – 0.56, 2.67 – 5.78, 11.21 – 23.56
 It is NOT significant ( P> 0.05) when the
interval includes 1
– Examples 0.24 – 4.78, 0.02 – 2.56 etc
4/8/2024 37

Some Determinants of Sample Size
The study design e.g. Is it a cross sectional study?
The level of difference the study is designed to detect
between groups e.g. 10% or 15% ? The smaller the
difference, the higher the sample size & vice versa
Statistical power to detect an actual difference (type 2
error, commonly 90%)
The level of error (alpha ) the researcher is willing to
tolerate (type 1 error) usually 5% ( 95% CI)
Drop out/attrition/none response rate
4/8/2024 38

Sample Size Calculation for a Cross-
Sectional Study
Leslie-Kish formula
N =Zα2pq/d2
Where N=minimum sample size
Zα = level of significance at 95% confidence interval =1.96
P = previous estimate of proportion of interest= say 45.1%
(0.451) i.e. from literature or pilot study or use 50%
q = 1-P = 1- 0.451 = 0.549
d = degree of precision = 5% (0.05)
4/8/2024 39

Sample Size Calculation for a Cross-
Sectional Study 2
• Evaluating in the formula
• n= (1.96)2 x 0.451 x 0.549 / 0.052
• = 380
• Minimum sample size = 380
• Add 10 % non response rate = 380 x 100/90 = 421.8
• Therefore N= 422
4/8/2024 40

Sample size formula to compare two
independent proportions
Using the formula for calculating sample size for the
comparison of two independent proportions:
n/ group = 2( Z α + Z β )2 π ( 1-π)
d2
Where,
n = minimum sample size per group
Zα = standard normal deviate corresponding to the
probability of α i.e. the probability of making a type 1 error at
5% = 1.96
Zβ = standard normal deviate at 90% statistical power,
corresponding to the probability of making a type 2 error =
1.28
4/8/2024 41

independent proportions
π = mean of two proportions P1 and P 2
P1 = proportion of patients associated with the outcome of interest
P2 = proportion patients associated with the outcome of interest
d = the desired level of difference between the two groups P1 & P2
 Assuming the prevalence of the out come of interest is 24% (from
literature or your pilot study) then 24% will be used in this study to
detect a difference of say 15% between the two groups
4/8/2024 42

independent proportions 2
Therefore,
 P 1 = 24% = 0.24
 P 2 = 24 % + 15% = 39% ( = P1 + d )
 π = 24 + 39/2= 63/2 = 31.5 % = 0.315
 1-π = 1 – 0.315 = 0.69
 n = 2 (1.96+1.28)2 × 0.315 × 0.69
 0.152
 n = 21 × 0.315 × 0.69
 0.0225
 n = 203 = minimum sample size for each group
 Assuming 10% attrition rate =203 ×100/90 = 226 per group.
 Total sample size for the two groups = 452 participants.
4/8/2024 43

Sample Size for RCTs
N = 1 /(1-f) x [ 2 (Z  + Z )2 x P (1-P) ]
(P0 - P1)2
Where P = (P0 + P1)/2
SAMPLE SIZE FOR OTHER STUDY DESIGNS???
4/8/2024 44

Bivariate/Multivariate Analyses
 Bivariate analyses: Used to find relationship
between 2 variables or difference between groups
concerning a characteristic:
 Apply Chi square or t test etc as appropriate
 Use P values and confidence intervals for estimates
 Multivariate logistic regression is the most widely
used when more than 2 variables involved
4/8/2024 45

Practical Considerations for Logistic
Regression
• Sample size
• Selection of best variable type as predictor
variable
• Prevalence of the outcome or dependent
variable etc
4/8/2024 46

Logistic Regression Analyses
• Popular in medical research because many outcomes
are in qualitative units e.g. disease status, outcome
of illness etc
• Outcome variables are qualitative dichotomous or
multichotomous
• It is necessary to adjust for confounders (to develop
predictor models)
• The independent (or predictor) variables could be
quantitative or qualitative
4/8/2024 47

Example of a result of a logistic regression analysis of
contraceptive use on women’s characteristics
4/8/2024 48

Interpretation of results in the Table
• Age and location are significant
• Women aged less than 25 years are 4.76 times more
likely than those 35 years and above to use
contraceptives and this was a significant result (95%
CI = 2.45 – 8.23, P < 0.001)
4/8/2024 49

Exercises
• What type of analyses/ test statistic would you
use?
• HIV status compared among four groups of 500
women each: those married, never married,
divorced, separated
• Nutritional status of children compared between
three socioeconomic classes
• To identify predictors of suicide attempt – age,
gender, educational status, associated medical
illness
4/8/2024 50

Exercise
• Predicting the HIV status ( dependent
variable) of commercial sex workers using age
of sex worker, base (brothel or non brothel),
years in sex work, number of sexual partners,
condom use with partners, history of STI and
exposure to HIV AIDS intervention
(Independent or predictor variables)
4/8/2024 51

• A z-test is a statistical test to determine whether
two population means are different when the
variances are known and the sample size is
large.
• A t test is a statistical test that is used to compare
the means of two groups.
• In contrast, the T-test determines how averages
of different data sets differ in case the standard
deviation or the variance is unknown.
4/8/2024 52

• A chi-square test is a statistical test used to
compare observed results with expected results.
• ANOVA, which stands for Analysis of Variance, is a
statistical test used to analyze the difference
between the means of more than two groups.
• The Student's t test is used to compare the
means between two groups, whereas ANOVA is
used to compare the means among three or
more groups.
4/8/2024 53

• The Paired Samples t Test compares the
means of two measurements taken from
the same individual, object, or related
units.
• A paired t-test takes paired observations
(like before and after), subtracts one from
the other, and conducts a 1-sample t-test
on the differences.
• Paired-samples t tests compare scores on
two different variables but for the same
group of cases; independent-samples t
tests compare scores on the same variable
but for two different groups of cases.
4/8/2024 54

• Wilcoxon rank-sum test is used to compare
two independent samples, while Wilcoxon
signed-rank test is used to compare two
related samples, matched samples, or to
conduct a paired difference test of repeated
measurements on a single sample to assess
whether their population mean ranks differ.
4/8/2024 55

Regression analysis
Named according to outcome variable thus :
Qualitative outcomes – Logistic regression
(Bivariate or Multivariate regression)
Poisson Regression Analysis
Numeric outcome (Quantitative continuous normally
distributed) – multiple linear regression
Time to an event as outcome/ Survival analysis – Cox
regression
4/8/2024 65

Interpreting results from multivariate
analyses
• Multivariable methods and estimates are reported
as:
– Multivariate/Poisson regression (odds ratios and
CI)
– Multiple linear regression (regression coefficients)
– Cox regression (hazard ratios)
4/8/2024 66

Conclusion
• Summary
• What did you hear ?
• Any take home ?
• Were your expectations met ???
4/8/2024 67

Bibliography
• Essentials of Medical Statistics, BR. Kirwood, A.C
Jonathan. Blackwell Science, 3rd edit. 2021.
• Fundamentals of Statistics, SC Gupta, Himalaya
Publishing House. 7th edit. 2019.
4/8/2024 68

THANK YOU FOR LISTENING
4/8/2024 69

COM 301 INFERENTIAL STATISTICS SLIDES.ppt

Recommended

Recommended

More Related Content

Similar to COM 301 INFERENTIAL STATISTICS SLIDES.ppt

Similar to COM 301 INFERENTIAL STATISTICS SLIDES.ppt (20)

Recently uploaded

Recently uploaded (20)

COM 301 INFERENTIAL STATISTICS SLIDES.ppt