2_Lecture 2_Confidence_Interval_3.pdf

Conﬁdence intervals
and Sample Size
Assist.Prof. TOL BUNKEA, MD,MSc-Epidemiology
(London School of Hygiene and Tropical Medicine, UK)
Head of Epidemiology Unit of National Centre for Parasitology, Entomology and
Malarial Control (CNM, MoH)
Lecturer of Epidemiology and Biostatistics (NIPH, UHS, UP)
Tel: 016 690 999
Email: tolbunkea@ymail.com,

Objec&ves:
At the end of this session you will be able to:
1. Distinguish between sample statistic and population parameter
2. Find the conﬁdence interval for the mean when s is unknown
3. Interpret confidence interval in the estimation of the population
parameter.
4. Describe factors influencing the width of the confidence interval
5. Explain reasons for using confidence interval

Outline
① Population and Samples
② Point Es3mates and Conﬁdence Intervals
③ Effect of sample size on Confidence Intervals
④ Confidence interval for mean and proportion
3

Introduction
Statistical inference is the process by which we
draw conclusions about a population from data
collected on a sample.
4

Population and Samples
• Census
–Everyone in popula7on
• Eg. All Cambodian residents
• Popula+on
–is a set of persons (or objects) having a
common observable characteris7c.
–the en7re collec7on of units about
which we would like informa7on.

• Sample
–is a representative subject (subgroup) of a
population.
–the collection of units we actually measure
• Example:
–If we want to know many persons in a
community
• have quit smoking or
• have health insurance or
• plan to vote for a certain candidate,

• In medical research
–Population
• All patients candidate for treatment
–Sample
• All patients candidate for treatment who volunteer
for your study
• Infer results from volunteers (sample) to other
candidates for the same treatment (population).
–We usually obtain information on an
appropriate sample of the community and
generalize from it to the entire population.

• The way the sample is selected, not its size,
determines whether we may draw
appropriate inferences about a population.
• The primary reason for selecting a sample
from a population is to draw inferences about
that population.
• Statistical inference is the process by which
we infer population properties from sample
properties.
• A major component: Parameter and Statistic

Popula&on and Samples
Parameters
ØAre fixed values “truth”, but we rarely
know them because it is often difficult
to obtain measures from the entire
population.
ØThe true value we hope to obtain.

Statistics
ØAre known values computed from a sample; they
are random variables because they differ from
sample to sample.
ØAn estimate of the parameter based on observed
information in the sample
ØStatistics have associated error
• The study of statistics is about estimating that
error
• Central limit theorem tells us how much error to
expect in our sample estimates (i.e. sample
statistics)

Example
• A study is conducted to estimate the true mean
annual income of all adult residents of Cambodia.
• The study randomly selects 2000 adult residents of
Cambodia. What are the population? Sample? Parameter? Statistic?
• The population consists of all adult residents of
Cambodia.
• The sample is the 2000 residents in the study.
• The parameter is the true mean annual income of
all adult residents of Cambodia.
• The statistic is the mean of the 2000 residents in
this sample.

Example
• A survey is carried out at a university to estimate the
proportion of undergraduate students who drive to campus to
attend classes.
• One thousand students are randomly selected and asked
whether they drive or not to campus to attend classes.
– What are the population? Sample? Parameter? Statistic?
• The population is all of the undergraduates at that university.
• The sample is the group of 1000 undergraduate students
surveyed.
• The parameter is the true proportion of all undergraduate
students at that university who drive to campus to attend
classes.
• The statistic is the proportion of the 1000 sampled
undergraduates who drive to campus to attend classes.

Statistical methods to make inferences about the
population from the sample

One aspect of inferential statistics is estimation,
which is the process of estimating the value of a
parameter from information obtained from a
sample.
Point and Interval Estimates

• Since the populaLons from which these values
were obtained are large, these values are only
esLmates of the true parameters and are
derived from data collected from samples.
• The staLsLcal procedures for esLmaLng the
populaLon mean, propor?on, variance, and
standard devia?on will be explained.
• An important quesLon in esLmaLon is that of
sample size.

• How large should the sample be in order to
make an accurate estimate?
• This question is not easy to answer since the
size of the sample depends on several factors,
such as the accuracy desired and the
probability of making a correct estimate.
• The question of sample size will be explained.

Conﬁdence Intervals for the Mean
When s Is Known and Sample Size

• Suppose a college president wishes to estimate the
average age of students attending classes this semester.
• The president could select a random sample of 100
students and ﬁnd the average age of these students, say,
22.3 years.
• From the sample mean, the president could infer that the
average age of all the students is 22.3 years.
• This type of estimate is called a point estimate.

• You might ask why other measures of central
tendency, such as the median and mode, are not
used to esCmate the populaCon mean.
• The reason is that the means of samples vary less
than other staCsCcs (such as medians and modes)
when many samples are selected from the same
populaCon.
• Therefore, the sample mean is the best esCmate of
the populaCon mean.

• Sample measures (i.e., statistics) are used to
estimate population measures (i.e., parameters).
• These statistics are called estimators.
• As previously stated, the sample mean is a better
estimator of the population mean than the sample
median or sample mode.
• A good estimator should satisfy the three properties
described now.

Conﬁdence Intervals
• The sample mean will be, for the most part,
somewhat different from the population
mean due to sampling error.
• Therefore, you might ask a second question:
How good is a point estimate?
• The answer is that there is no way of knowing
how close a particular point estimate is to the
population mean.

• This answer places some doubt on the
accuracy of point estimates.
• For this reason, statisticians prefer another
type of estimate, called an interval estimate.

• In an interval estimate, the parameter is specified as being
between two values.
• For example, an interval estimate for the average age of all
students might be
26.9 < µ < 27.7, or
27.3 ± 0.4 years.
• Either the interval contains the parameter or it does not.
• A degree of confidence (usually a percent) can be assigned
before an interval estimate is made.
• For instance, you may wish to be 95% confident that the
interval contains the true population mean.
• Another question then arises.
Why 95%? Why not 99 or 99.5%?

• If you desire to be more confident, such as 99 or
99.5% confident, then you must make the interval
larger.
• For example, a 99% confidence interval for the mean
age of college students might be
26.7 < µ < 27.9, or
27.3 ± 0.6.
• Hence, a tradeoff occurs.
• To be more confident that the interval contains the
true population mean, you must make the interval
wider.

• Intervals constructed in this way are called
confidence intervals.
• Three common confidence intervals are used:
the 90, the 95, and the 99% confidence intervals.

The central limit theorem states that when the sample
size is large, approximately 95% of the sample means
taken from a population and same sample size will fall
within ± 1.96 standard errors of the population mean,
that is,
µ ± 𝟏. 𝟗𝟔
s
𝒏

Hence, you can be 95% conﬁdent that the population
mean is contained within that interval when the values
of the variable are normally distributed in the
population.

A point estimate is the statistic, computed from
sample information, which is used to estimate
the population parameter (single number) that
is an estimate of the population parameter;

• Point estimate is one of the main purposes of
statistics.
• The basic idea is that we take a sample of data and
use it to make inferences about the population of
interest.
• Point estimate involves the calculation of
confidence intervals for some statistic (For ex. a
proportion or an average)

Point and Interval Es?mates
• A confidence interval estimate is a range of values
constructed from sample data so that the population
parameter is likely to occur within that range at a
specified probability.
• A range of values within which, we believe, the true
parameter lies with high probability.
• The specified probability is called the level of
confidence.
• Point estimate is a form of statistical inference.
• In point estimation we use the data from the sample
to compute a value of a sample statistic that serves
as an estimate of a population parameter.

Example:
• A random sample of 32 patients treatment
cost is taken from a local hospital. Find a point
estimate for the population mean µ.
• The point estimate for the population mean µ
of treatment cost is 74.22$.

Confidence Intervals
• Describes the precision of the estimate.
• The CI represents a range of values on either
side of the estimate.
• The narrower the CI, the more precise the
point estimate.

Example
• A large bag of 500 red, green and blue marbles:
– You want to know the percentage of green marbles
but don’t want to count every marble.
– Shake up the bag and select 50 marbles to give an
estimate of the percentage of green marbles
• Sample of 50 marbles:
– 15 green marbles, 10 red marbles, 25 blue marbles
– Based on sample we conclude that 30% (15 out of 50) marbles
are green
– 30% = point estimate

Example
• How confident are we in this estimate?
– Actual percentage of green marbles could be higher
or lower, ie. sample of 50 may not reflect
distribution in entire bag of marbles
• Can calculate a confidence interval to
determine the degree of uncertainty.
• How do you calculate a confidence interval?
• Can do so by hand or use a statistical program
– Epi Info, SAS, STATA, SPSS and Episheet are common
statistical programs

• Most commonly used confidence interval is the 95%
interval
Ø95% CI indicates that our estimated range has a 95%
chance of containing the true population value
• Assume that the 95% CI for our bag of marbles
example is 17-43%
• We estimated that 30% of the marbles are green:
ØCI tells us that the true percentage of green marbles
is most likely between 17 and 43%
ØThere is a 5% chance that this range (17-43%) does
not contain the true percentage of green marbles

• If we want less chance of error we could
calculate a 99% confidence interval
ØA 99% CI will have only a 1% chance of error but
will have a wider range
Ø99% CI for green marbles is 13-47%
• If a higher chance of error is acceptable we
could calculate a 90% confidence interval
Ø90% CI for green marbles is 19-41%

• Very narrow CIs indicate a very precise estimate.
• Can get a more precise estimate by taking a larger
sample
Ø100 marble sample with 33 green marbles
• Point estimate is(33%)
• 95% confidence interval is 21-39% (rather than 17-43% for
original sample)
Ø200 marble sample with 56 green marbles
• Point estimate is 28%
• 95% confidence interval is 24-36%
• CI becomes narrower as the sample size
increases

Formula of Conﬁdence Intervals
• 95% CI for a mean & proporLon
x̅ ± 1.96 x SE(x)
• 95% CI for a rate, rate raLo, SMR or odds raLo
Rate ÷/x Error factor
Rate raLo ÷/x Error factor
Odds raLo ÷/x Error factor
SMR ÷/x Error factor

• Statisticians can calculate a range (interval) in which we can be
fairly sure (confident) that the “true value” lies.
– For example, we may be interested in blood pressure (BP)
reduction with antihypertensive treatment.
– From a sample of treated patients we can work out the
mean change in BP.
• However, this will only be the mean for our particular sample.
• If we took another group of patients we would not expect to
get exactly the same value, because chance can also affect the
change in BP.
• The CI gives the range in which the true value (i.e. the mean
change in BP if we treated an infinite number of patients) is
likely to be.

Interpretation of CI
• We can be 95% confident that the true mean
cholesterol of population (parameter) lies within
this interval 194.3 and 198.7.
• We are 95% confident that the true mean
cholesterol of population (parameter) is between
194.3 and 198.7
• The interpretation of CI always relates to a
parameter, and never a statistic.

What precisely do we mean by 95% confident?
• Suppose we were to repeatedly sample from the
population, and calculate a 95% CI for each sample.
• 95% of those 95% CI would capture the true value of
the population.
• Suppose we take a random sample of 10 students
from a high school and obtain their score of Math
Exam.
• These 10 students had a mean of 12 with a
corresponding 95% CI (11, 15)

Interpreta(on of CI
• We can be 95% conﬁdent that the populaLon
mean Math score for students in this school
lies between 11 and 15.
• In repeated sampling, 95% of the 95% CIs
calculated in this manner would capture the
true mean Math score of students in this
school.

Wrong Interpretation of CI
• 95% of students in this school have Math score
that lie between 11 and 15.
• We can be 95% confident that the sample
mean Math score of the 10 students lies
between 11 and 15.
• In repeated sampling, 95% of the interval will
capture the sample mean.

Example
• What is the complication rate of thoracoscopy at GHS? How to
interpret?
• Using 3 years of data from GHS there were 52 patients who
had a thoracoscopy; of these, 4 patients had a complication
(7.7%) complication rate an (95% CI = 2.5%, 17.5%).
Interpretation:
• Based on our sample data, we are 95% confident that the
"true" complication rate at GHS is between 2.5% and 17.5%.
• Another interpretation:
– if we were to take 100 additional samples, 95 times out of
100, the complication rate would fall between 2.5% and
17.5%.

Example
• The statistics professors at a university want
to estimate the average statistics anxiety
score for all of their undergraduate students.
• It would be too time consuming and costly to
give every undergraduate student at the
university their statistics anxiety survey.
• Instead, they take a random sample of 50
undergraduate students at the university and
administer their survey.

Example cont.
• Using the data collected from the sample, they
construct a 95% confidence interval for the mean
staCsCcs anxiety score in the populaCon of all
university undergraduate students.
• They are using x̅ to esCmate μ.
• If the 95% confidence interval for μ is 26 to 32, then
we could say,
“we are 95% confident that the mean staasacs anxiety
score of undergraduate students at this university is
between 26 and 32.”
• In other words, we are 95% confidence that
26≤μ≤32. This may also be wri[en as
[29, 95% CI: 26,32] or [29, 95% CI: 26 to 32]

• A range computed using sample statistics to
estimate an unknown population parameter
with a given level of confidence.
• A range (or interval) of values used to
estimate the true value of a population
parameter.

Factors Affecting Confidence Interval Estimates
The factors that determine the width of a
confidence interval are:
1. The sample size, n.
2. The variability in the population,
usually σ estimated by s.
3. The desired level of confidence.

Sample Size
• Sample size determination is closely related to
statistical estimation.
• Quite often, you ask, How large a sample is
necessary to make an accurate estimate?
• The answer is not simple, since it depends on three
things:
1. the maximum error of the estimate,
2. the population standard deviation, and
3. the degree of conﬁdence.

Sample Size
• For example, how close to the true mean do you
want to be (2 units, 5 units, etc.), and how conﬁdent
do you wish to be (90, 95, 99%, etc.)?
• For the purpose of this chapter, it will be assumed
that the populaCon standard deviaCon of the
variable is known or has been esCmated from a
previous study.

Example
[0.76, 95% CI: 0.701,0.819 ]or [0.76, 95% CI: 0.701 to 0.819 ]

• Thus, a 95% confidence interval for a mean is
calculated as follows:
• If we took thousands of samples, and for each
sample calculated the mean and associated 95%
confidence interval, we would expect 95% of
these confidence intervals to include the
population mean.
Confidence Interval for a Mean

Exercise
• The interpretation of the confidence interval
in this statement is (B)
Conﬁdence Interval for a Mean

• SomeLmes we may wish to use other
confidence intervals such as 90% or 99%
confidence intervals.
• For a 99% confidence interval the value 1.96
used in the formula for a 95% confidence
interval becomes 2.58.
• For a 90% confidence interval the value 1.96
in the formula used previously becomes
1.65..
Confidence Interval for a Mean

2_Lecture 2_Confidence_Interval_3.pdf

2_Lecture 2_Confidence_Interval_3.pdf

Recommended

Recommended

More Related Content

Similar to 2_Lecture 2_Confidence_Interval_3.pdf

Similar to 2_Lecture 2_Confidence_Interval_3.pdf (20)

Recently uploaded

Recently uploaded (20)

2_Lecture 2_Confidence_Interval_3.pdf