Lecture-3 inferential stastistics.ppt

Haileab.F(BSc, MPH)
University of Gondar
College of medicine and health science
Department of Epidemiology and Biostatistics
Statistical Inference

Objectives
 After completing this session you will be able
to do
 Understand basics of statistical inferences
 Apply statistical inference on real data sets
2

Introduction # 1
 Inferential is the process of generalizing or
drawing conclusions about the target
population on the basis of results obtained
from a sample.
3

Introduction #2
 Statistical inference can be either parametric or
non-parametric
 Example: The path way for the analysis of
continuous variables is shown below
4

Introduction #3
 We have two facts that are key to statistical
inference.
• Population parameters are fixed numbers whose values
are usually unknown
• Sample statistics are known values for any given
sample, but vary from sample to sample taken from the
same population.
• This variability of sample statistics(sampling
variation) is always present and must be
accounted for in any inferential procedure by
identifying probability distributions that describe
the variability of sample statistics.
5

Introduction #4
 The frequency distribution of all these samples forms the
sampling distribution of the sample statistic
6

Introduction #5
7
 This sampling distribution has characteristics that can be related to
those of the population from which the sample is drawn.
 This relationship is usually provided by the parameters of the probability
distribution describing the population.
 E.g. Sampling Distribution of the means and proportions

Illustrative examples of Distribution of Sample
Mean
 Consider an experiment consisting of drawing two disks
from five, replacing the first before drawing the second,
and then computing the mean of the values on the two
disks.
8

9
Properties of sampling
Dist….
1. The mean of the sampling distribution of
means is the same as the population mean,  .
2. The SD of the sampling distribution of means
is
 / n (Standard error) .
3. The shape of the sampling distribution of
means is approximately a normal curve,
regardless of the shape of the population
distribution and provided n is large enough
Haileab.f (MPH) 3/1/2023

Assignment
 Other Sampling Distributions
 Eg. T-distribution, chi-square distributions , F
distributions etc.
 Relationships among the Distributions
10

Principles of Inference
 As we have repeatedly noted, one of the primary
objectives of a statistical analysis is to use data
from a sample to make inferences about the
population from which the sample was drawn.
 A statistical inference is composed of two parts:
1. A statement about the value of that parameter, and
2. a measure of the reliability of that statement,
usually expressed as a probability
 Traditionally statistical inference is done with
one of two different but related objectives in
mind.
11

Assumptions
 Two major assumptions are needed to assure
the correctness for statistical inferences:
• randomness of the sample observations, and
• the distribution of the variable(s) being
studied.
12

Principles of Inference …...
13
 tests of hypotheses, in which we hypothesize that one or
more parameters have some specific values or
relationships, and make our decision about the
parameter(s) based on one or more sample statistic. In
this type of inference, the reliability of the decision is the
probability that the decision is incorrect.
 Estimate one or more parameters using sample
statistics. This estimation is usually done in the form of an
interval, and the reliability of this inference is expressed

Estimation
14
 Two methods of estimation are commonly used: point
estimation and interval estimation
1. Point estimation: - A single numerical value
used to estimate the corresponding population
parameter

2. Interval estimation: Is a range (an interval) of
values used to estimate the true values of a
population parameter, with a specified degree of
confidence.
Confidence Interval (CI) estimate of a parameter
CI = Estimator ± (reliability coefficient) x (standard
error)
Haileab.f (MPH)
15
3/1/2023

Confidence Level: the probability 1 – α that is the proportion
of times that the confidence interval actually does contain
the population parameter, assuming that the estimation
process is repeated a large number of times.
 Also written (1 - α) = .95
Definition/Interpretation : 95% CI
1. Probabilistic interpretation:
 If all possible random samples (an infinite number) of a
given sample size (e.g. 10 or 100) were obtained and if
each were used to obtain its own CI, then 95% of all such
CIs would contain the unknown population parameter; the
remaining 5% would not.
 It is incorrect to say “There is a 95% probability that the CI
contains the unknown population parameter”.
16

2. Practical interpretation
 When sampling is from a normally distributed
population with known standard deviation, we
are 100 (1-α) [e.g., 95%] confident that the
single computed interval contains the unknown
population parameter.
17

Confidence intervals…
18
 The 95% confidence interval is calculated in such a way
that, under the conditions assumed for underlying
distribution, the interval will contain true population
parameter 95% of the time.
 Loosely speaking, you might interpret a 95% confidence
interval as one which you are 95% confident contains the
true parameter.
 90% CI is narrower than 95% CI since we are only 90%
certain that the interval includes the population parameter.
 On the other hand 99% CI will be wider than 95% CI; the
extra width meaning that we can be more certain that the
interval will contain the population parameter. But to obtain a
higher confidence from the same sample, we must be
willing to accept a larger margin of error (a wider interval).

 As the confidence interval increase =wider
certainty
 99%wider than 95% CI
 95% wider than 90% CI
 The larger the sample size the narrow CI
 More precise our estimate
19

20
 For a given confidence level (i.e. 90%, 95%, 99%)
the width of the confidence interval depends on
the standard error of the estimate which in turn
depends on the
 1. Sample size:-The larger the sample size, the
narrower the confidence interval (this is to mean the
sample statistic will approach the population parameter)
and the more precise our estimate. Lack of precision
means that in repeated sampling the values of the
sample statistic are spread out or scattered. The result
of sampling is not repeatable.

21
- To increase precision (of an SRS), use a larger
sample. You can make the precision as high as
you want by taking a large enough sample. The
margin of error decreases as√n increases.
 2. Standard deviation:-The more the variation
among the individual values, the wider the
confidence interval and the less precise the
estimate. As sample size increases SD
decreases.
 Z is the value from SND
 90% CI, z=1.64
 95% CI, z=1.96

 More variation wider CI
 Less precise
 Increase sample size
 Decrease SD
22

Estimation for Single Population
24

Margin of Error
(Precision of the estimate)
25

Example:
1. Waiting times (in hours) at a particular
hospital are believed to be approximately
normally distributed with a variance of 2.25
hr.
a. A sample of 20 outpatients revealed a mean
waiting time of 1.52 hours. Construct the 95%
CI for the estimate of the population mean.
b. Suppose that the mean of 1.52 hours had
resulted from a sample of 32 patients. Find the
95% CI.
c. What effect does larger sample size have on
the CI?
26

a.
)
17
.
2
,
87
(.
65
.
52
.
1
)
33
(.
96
.
1
52
.
1
20
25
.
2
96
.
1
52
.
1






•  = standard deviation= square root of sd
•We are 95% confident that the true mean waiting time is between
0.87 and 2.17 hrs.
• Although the true mean may or may not be in this interval, 95%
of the intervals formed in this manner will contain the true mean.
• An incorrect interpretation is that there is 95% probability that
this
interval contains the true population mean.
27

B. Unknown variance
(small sample size, n ≤ 30)
 What if the  for the underlying population is
unknown and the sample size is small?
 As an alternative we use Student’s t
distribution.
28

Example
29
Haileab.f (MPH)b 3/1/2023
t-value at 90% CL at 19 df =1.729

 Xbar +CC*SE
 Confident coefficient =t tab for CI is given
 SE/stundared error =√varianc/n OR SD/√n
31

2. CIs for single population
proportion, p
 Is based on three elements of CI.
 Point estimate
 SE of point estimate
 Confidence coefficient
32

Example 1
 A random sample of 100 people shows that 25
are left-handed. Form a 95% CI for the true
proportion of left-handers.
34

Interpretation
35

Hypothesis testing
 A hypothesis usually results from speculation concerning
observed behavior, natural phenomena, or established
theory.
 If the hypothesis is stated in terms of population
parameters such as the mean and variance, the
hypothesis is called a statistical hypothesis where as
sample is called a test of the hypothesis
36

type of Hypotheses
37
 Null hypothesis (represented by HO) is the statement about the value of
one or more population parameter. That is the null hypothesis postulates
that ‘there is no difference between factor and outcome’ or ‘there is no an
intervention effect’.
 Alternative hypothesis (represented by HA) states the ‘opposing’ view
that ‘there is a difference between factor and outcome’ or ‘there is an
intervention effect’.
 This hypothesis is declared to be accepted if the null hypothesis is
rejected.

Steps in Hypothesis Testing
1. Formulate the appropriate statistical hypotheses clearly
• Specify HO and HA
H0:  = 0 H0:  ≤ 0 H0:  ≥ 0
H1:   0 H1:  > 0 H1:  < 0
two-tailed one-tailed one-tailed
2. State the assumptions necessary for computing
probabilities
• A distribution is approximately normal (Gaussian)
• Variance is known or unknown
3. Select a sample and collect data
• Categorical, continuous
38

4. Decide on the appropriate test statistic for the
hypothesis. E.g., One population
5. Specify the desired level of significance for the
statistical test (=0.05, 0.01, etc.)
OR
39

6. Determine the critical value.
 A value the test statistic must attain to be declared significant.
(Two tailed ɑ = 5%) One tailed , >( ɑ = 10% ) One tailed , < (ɑ =10%)
7. Obtain sample evidence and compute the test statistic
8. Reach a decision and draw the conclusion
• If Ho is rejected, we conclude that HA is true (or accepted).
• If Ho is not rejected, we conclude that Ho may be true.
-1.96 1.96 1.645 -1.645
40

Types of Errors in Hypothesis
Tests
 Whenever we reject or accept the Ho, we
commit errors.
 Two types of errors are committed.
 Type I Error
 Type II Error
41

Rule of decision making
42
 The rejection or critical region is the range of values of a
sample statistic that will lead to rejection of the null
hypothesis
 Obviously we cannot make both types of errors
simultaneously, and in fact we may not make either, but the
possibility does exist.
 In fact, we will usually never know whether any error has been
committed. The only way to avoid any chance of error is not to make a
Type of decision H0 true H0 false
Reject H0 Type I error (a)
Correct decision (1-
β)
Accept H0
Correct decision (1-
a)
Type II error (β)

Test Statistics
43
 A test statistics is a value we can compare with
known distribution of what we expect when the null
hypothesis is true.
 The general formula of the test statistics is:
Observed _ Hypothesized
 Test statistics = value value .
Standard error

The P- Value
44
 In most applications, the outcome of performing a
hypothesis test is to produce a p-value.
 P-value is the probability that the observed difference is due
to chance.
 A large p-value implies that the probability of the value
observed, occurring just by chance is low, when the null
hypothesis is true.
 That is, a small p-value suggests that there might be
sufficient evidence for rejecting the null hypothesis.
 The p value is defined as the probability of observing the
computed significance test value or a larger one, if the H0
hypothesis is true. For example, P[ Z >=Zcal/H0 true].

P-value……
 A p-value is the probability of getting the observed
difference, or one more extreme, in the sample purely
by chance from a population where the true difference is
zero.
 An “empirical” significance level or indicator of
the weight of evidence against the null
hypothesis.
45

How to calculate P-value
o Use statistical software like SPSS, SAS……..
o Hand calculations
—obtained the test statistics (Z Calculated or t-
calculated)
—find the probability of test statistics from
standard normal table
—subtract the probability from 0.5
—the result is P-value
Note if the test two tailed multiply 2 the result.

The P- Value …..
47
 But for what values of p-value should we reject the null
hypothesis?
 By convention, a p-value of 0.05 or smaller is
considered sufficient evidence for rejecting the null
hypothesis.
 By using p-value of 0.05, we are allowing a 5%
chance of wrongly rejecting the null hypothesis
when it is in fact true.
 When the p-value is less than to 0.05, we often say that
the result is statistically significant.

Hypothesis testing for single population
mean
48
 EXAMPLE 1: A researcher claims that the mean of the IQ
for university students is 100 with a standard deviation of 10
and the expected value for a sample of 16 students is 110.
Test the hypothesis .
 Solution
1. Ho:µ=100 VS HA:µ≠100
2. Assume α=0.05
3. Test statistics: z=(110-100)10/4=10/1/10/4
4. z-critical at 0,025 is equal to 1.96.
5. Decision: reject the null hypothesis since 4 ≥ 1.96
6. Conclusion: the mean of the IQ for all population is different

Hypothesis testing for single proportions
49
 Example: In the study of childhood abuse in psychiatry patients, brown
found that 166 in a sample of 947 patients reported histories of physical
or sexual abuse.
a) constructs 95% confidence interval
b) test the hypothesis that the true population proportion is
30%?
 Solution (a)
 The 95% CI for P is given by
]
2
.
0
;
151
.
0
[
0124
.
0
96
.
1
175
.
0
947
825
.
0
175
.
0
96
.
1
175
.
0
)
1
(
2











n
p
p
z
p 

Example……
50
 To the hypothesis we need to follow the steps
Step 1: State the hypothesis
Ho: P=Po=0.3
Ha: P≠Po ≠0.3
Step 2: Fix the level of significant (α=0.05)
Step 3: Compute the calculated and tabulated value of the test statistic
96
.
1
39
.
8
0149
.
0
125
.
0
947
)
7
.
0
(
3
.
0
3
.
0
175
.
0
)
1
(











tab
cal
z
n
p
p
Po
p
z

Example……
51
 Step 4: Comparison of the calculated and tabulated values of
the test statistic
 Since the tabulated value is smaller than the calculated
value of the test the we reject the null hypothesis.
 Step 6: Conclusion
 Hence we concluded that the proportion of childhood abuse
in psychiatry patients is different from 0.3
 If the sample size is small (if np<5 and n(1-p)<5) then use
student’s t- statistic for the tabulated value of the test statistic.

Two sample mean and
proportion
52
 Still now we have seen estimate for only single mean and
single proportion. However it is possible to compute point
and interval estimation for the difference of two sample
means.
 let x1, x2, …, xn1 are samples from the first population and
y1, y2, …, yn2 be sample from the second population.
 Sample mean for the first population be
 Sample mean for the second population
 Then the point estimate for the difference of means (µ1-µ2)
is given by
)
( Y
X 
Y
X

Two sample estimation
53
 A (1-α)100% confidence interval for the
difference of means is given If are
known
2
2
2
1
2
1
2
)
(
n
n
z
y
x


 


2
1, 
 and

Hypothesis testing for two sample means
54
 The steps to test the hypothesis for difference of means is
the same with the single mean
Step 1: state the hypothesis
Ho: µ1-µ2 =0
VS
HA: µ1-µ2 ≠0, HA: µ1-µ2 <0, HA: µ1-µ2 >0
Step 2: Significance level (α)
Step 3: Test statistic
2
2
2
1
2
1
2
1 )
(
)
(
n
n
y
x
zcal










Example
55
 A researchers wish to know if the data they have collected
provide sufficient evidence to indicate a difference in mean
serum uric acid levels between normal individual and
individual with down’s syndrome. The data consists of serum
uric acid readings on 12 individuals with down’s syndrome
and 15 normal individuals. The means are 4.5mg/100ml and
3.4 mg/100ml with standard deviation for the population to be
2.9 and 3.5 mg/100ml respectively.
0
:
0
:
2
1
2
1








A
O
H
H

SOLUTION
56
96
.
1
33
.
5
23
.
1
6
.
1
5178
.
1
6
.
1
15
5
.
3
12
9
.
2
0
)
4
.
3
3
.
4
(
)
(
)
(
025
.
0
2
2
2
2
2
2
1
2
1
2
1














z
z
n
n
y
x
zcal






Estimation and hypothesis testing for two population
proportion
57
 Let n1 and n2 be the sample size from the two population. If x
and y are the out come of interest then the point estimate for
each population is given by p1=x/n1 and p2=y/n2 respectively.
 The point estimates π1-π2 =p1-p2
 The interval estimate for the difference of proportions is
given by
 If the sample size is large and n1p1>5, n1 (1-p1)>5, n2p2>5,
then







 




2
2
2
1
1
1
2
2
1
)
1
(
)
1
(
n
p
p
n
p
p
z
p
p 

Hypothesis testing for two proportions
58
 To test the hypothesis
Ho: π1-π2 =0
VS
HA: π1-π2 ≠0
The test statistic is given by
2
2
2
1
1
1
2
1
2
1
)
1
(
)
1
(
)
(
)
(
n
p
p
n
p
p
p
p
zcal










Summary
 Students sometimes have difficulty deciding whether to
use Za/2 or t a/2 values when ﬁnding conﬁdence
intervals
60

t-test
One sample t-test:
 It is used to compare the estimate of a sample with a
hypothesized population mean to see if the sample
is significantly different.
 Assumptions which should be fulfilled before we use
this method:
 The dependent variable is normally distributed within the
population
 The data are independent (scores of one participant are not
dependent on scores of the other)

T-test cont…
 Hypothesis: Ho: μ = μo Vs HA: μ≠ μo ,
Where μo is the hypothesized mean value
The test statistics is : tcalc = (x
̄ – μ)/(s/√n)
 We compare the calculated test statistics (tcalc) with the
tabulated value (ttab) at n-1 degree of freedom

No Distance
in miles
Drug use No Distance
in miles
Drug
use
1 14.5 no 10 18.4 yes
2 13.4 no 11 16.9 yes
3 14.8 yes 12 12.6 not
4 19.5 yes 13 13.4 not
5 14.5 no 14 16.3 yes
6 18.2 yes 15 17.1 yes
7 16.3 no 16 11.8 not
8 14.8 no 17 13.3 yes
9 20.3 yes 18 14.5 not
Mean 15.59
Standard
deviation
2.43
T-test cont…
E.g. Data: The distance covered by marathon runners until a physiological
stress develops and whether they used drug or not

T-test cont..
It is believed that the mean distance covered
before feeling physiological stress is 15 miles
Hypotheses: Ho: = μ = 15 versus HA: μ ≠ 15
Level of significance: α = 5%
= 15.59, S = 2.43,
tcalc = (x – μ)/(s/√n) (15.59-15)/.57
= 1.03, and P-value = 0.318
At 17 degree of freedom and α = 0.05, ttab = 2.110,
Since tcal = 1.03 < 2.110 = ttab, or α = 0.05 < 0.318 =p-value
we fail to reject Ho
x
̄

Two sample t- test
 A t-distribution can be used for testing hypotheses
about differences of means for independent samples if
both populations are normal and have the same
variances.
 Assesses whether the means of two samples are
statistically different from each other. This analysis is
appropriate whenever you want to compare the means
of two samples/ conditions
 Assumptions of a t-test:
 from a parametric population
 not (seriously) skewed
 no outliers

right hemisphere
Left hemisphere
lesion site
12
10
8
6
95%
CI
infer
comp
t-tests….
 Compare the mean between 2 samples/ conditions
 if 2 samples are taken from the same population,
then they should have fairly similar means
 if 2 means are statistically different, then the
samples are likely to be drawn from 2 different
populations, ie they really are different
Exp. 1 Exp. 2

T-test cont..
b. Paired t- test
 Each observation in one sample has one and only
one mate in the other sample dependent to each
other.
 For example, the independent variable can be
measurements like:
before and after (e.g before and after an intervention),
or repeated measurement (e.g. using digital and
analog apparatus), or when the two data sources are
dependent (e.g. data from mother and father of
respondent)
Hypothesis: Ho: μd = 0 Vs HA: μd ≠ 0

T-test cont..
Subject BP before BP after Difference (di)
1 130 110 -20
2 125 130 +5
3 140 120 -20
4 150 130 -20
5 120 110 -10
6 130 130 0
7 120 115 -5
8 135 130 -5
9 140 130 -10
10 130 120 -10
d (Average of d) -9.5
Sd (Standard deviation of d) 8.64
Example : The blood pressure (BP) of 10 mothers were
measured before and after taking a new drug.

T-test cont..
Hypothesis: Ho: μd = 0 Versus HA: μd ≠ 0
Set the level of significance or α = 0.05
d = -9.5, Sd = 8.64, n = 10,.
tcalc = (d – μd)/(sd/√n) = 3.48 and p-value = 0.0075,
t-tab< t-cal OR t-cal >t-tab =REJECT Ho
At n-1 = 9 df and α = 0.05, ttab = 2.26
Since ttab = 2.26 < 3.48 = tcalc or p-value = 0.0075 < 0.05 = α
We reject Ho

T-test cont..
c.
c. Two independent samples t-test
 Used to compare two unrelated or independent groups
 Assumptions include:
 The variance of the dependent variable in the two
populations are equal
 The dependent variable is normally distributed within
each population
 The data are independent (scores of one participant
are not related systematically to the scores of the
others)
 Hypothesis: Ho: μt = μc Vs HA: μt ≠ μc ,
Where μt and μc are the population mean of treatment
and control (placebo) groups respectively.

71
The test statistics for two sample T-test cont….
 There are three cases which depend on what is known
about the population variances.
Case1:
 Population variances are known for normal
populations (or non normal populations with both
and large). In this case the test statistic is to be :
1
n
2
n
2
2
2
1
2
1
2
1
n
n
X
X
Z





2 2
1 2
and
 

72
Case2:
 Populations are unknown but are to be equal
in normal populations. In this case, we pool our
estimates to get the pooled two- sample variance
 For unknown distribution
 And the test statistic is to be
 Which has a distribution if is true.
2
2
2
2
1


 

2
2
1
2
2
)
1
2
(
2
1
)
1
1
(
2






n
n
S
n
S
n
p
S
)
2
1
1
1
(
2
2
1
n
n
p
S
X
X
T



2
1 2
t n n
 
0
H

73
 Case 3:
 and are unknown and unequal
normal populations . In this case the test
statistic is given by:
which does have a known distribution. If both n1and n2
are large (both over 30) we can assume a normal
distribution
1
2

2
2

2
2
2
1
2
1
2
1
n
S
n
S
X
X
T





Example
Do the marathon runners grouped by their drug intake status
differ in their average distance coverage before they feel
any physiological stress?
Hypothesis: Ho: μt = μc Vs HA: μt ≠ μc, where μt and μc are
for drug users and non-users respectively
Set the level of significance, α = 5%,
xc = 13.98, sc = 1.33, xt = 17.20, st= 2.21
tcalc = (xc – xt)/√S2(nc + nt) = -3.741, and its p-value = 0.002
S2 = is the pooled (combined) variance of both groups.
At 16 df and α = 0.05, ttab = -2.12
Since tcal= -3.741 < -2.12, or P-Value = 0.002 < 0.05 = α
We reject Ho

T-test cont…
 Here in the case of two independent sample t-test,
we have one continuous dependent variable
(interval/ratio data) and;
 one nominal or ordinal independent variable with
only two categories
 In this last case (i.e. two
independent sample t-test), what
if there are more than two
categories for the independent
variable we have?

Inferences for Two or More Means
 Are the birth weights of children in different
geographical regions the same?
 Are the responses of patients to different medications
and placebo different?
 Are people with different age groups have different
proportion of body fat?
 Do people from different ethnicity have the same BMI?

Assignment
 Prevention of Violations of assumption
 Detection of Violations of assumptions
 goodness-of-fit tests
77

Lecture-3 inferential stastistics.ppt

Recommended

Recommended

More Related Content

Similar to Lecture-3 inferential stastistics.ppt

Similar to Lecture-3 inferential stastistics.ppt (20)

Recently uploaded

Recently uploaded (20)

Lecture-3 inferential stastistics.ppt