Confidence Interval Module
One of the key concepts of statistics enabling statisticians to make incredibly accurate predictions is called the Central Limit Theorem. The Central Limit Theorem is defined in this way:
· For samples of a sufficiently large size, the real distribution of means is almost always approximately normal.
· The distribution of means gets closer and closer to normal as the sample size gets larger and larger, regardless of what the original variable looks like (positively or negatively skewed).
· In other words, the original variable does not have to be normally distributed.
· This is because, if we as eccentric researchers, drew an almost infinite number of random samples from a single population (such as the student body of NMSU), the means calculated from the many samples of that population will be normally distributed and the mean calculated from all of those samples would be a very close approximation to the true population mean. It is this very characteristic that makes it possible for us, using sound probability based sampling techniques, to make highly accurate statements about characteristics of a population based upon the statistics calculated on a sample drawn from that population.
· Furthermore, we can calculate a statistic known as the standard error of the mean (abbreviated s.e.) that describes the variability of the distribution of all possible sample means in the same way that we used the standard deviation to describe the variability of a single sample. We will use the standard error of the mean (s.e.) to calculate the statistic that is the topic of this module, the confidence interval.
The formula that we use to calculate the standard error of the mean is:
s.e. = s / √N – 1
where s = the standard deviation calculated from the sample; and
N = the sample size.
So the formula tells us that the standard error of the mean is equal to the
standard deviation divided by the square root of the sample size minus 1.
This is the preferred formula for practicing professionals as it accounts for errors that may be a function of the particular sample we have selected.
THE CONFIDENCE INTERVAL (CI)
The formula for the CI is a function of the sample size (N).
For samples sizes ≥ 100, the formula for the CI is:
CI = (the sample mean) + & - Z(s.e.).
Let’s look at an example to see how this formula works.
* Please use a pdf doc. “how to solve the problem”, I have provided for you under the “notes” link.
Example 1
Suppose that we conducted interviews with 140 randomly selected individuals (N = 140) in a large metropolitan area. We assured these individuals that their answers would remain confidential, and we asked them about their law-breaking behavior. Among other questions the individuals were asked to self-report the number of times per month they exceeded the speed limit. One of the objectives of the study was to estimate (make an inference about) the average nu.
Presiding Officer Training module 2024 lok sabha elections
Confidence Interval ModuleOne of the key concepts of statist.docx
1. Confidence Interval Module
One of the key concepts of statistics enabling statisticians to
make incredibly accurate predictions is called the Central Limit
Theorem. The Central Limit Theorem is defined in this way:
· For samples of a sufficiently large size, the real distribution of
means is almost always approximately normal.
· The distribution of means gets closer and closer to normal as
the sample size gets larger and larger, regardless of what the
original variable looks like (positively or negatively skewed).
· In other words, the original variable does not have to be
normally distributed.
· This is because, if we as eccentric researchers, drew an almost
infinite number of random samples from a single population
(such as the student body of NMSU), the means calculated from
the many samples of that population will be normally
distributed and the mean calculated from all of those samples
would be a very close approximation to the true population
mean. It is this very characteristic that makes it possible for us,
using sound probability based sampling techniques, to make
highly accurate statements about characteristics of a population
based upon the statistics calculated on a sample drawn from that
population.
· Furthermore, we can calculate a statistic known as the
standard error of the mean (abbreviated s.e.) that describes the
variability of the distribution of all possible sample means in
the same way that we used the standard deviation to describe
2. the variability of a single sample. We will use the standard
error of the mean (s.e.) to calculate the statistic that is the topic
of this module, the confidence interval.
The formula that we use to calculate the standard error of the
mean is:
s.e. = s / √N – 1
where s = the standard deviation calculated from the sample;
and
N = the sample size.
So the formula tells us that the standard error of the mean is
equal to the
standard deviation divided by the square root of the sample size
minus 1.
3. This is the preferred formula for practicing professionals as it
accounts for errors that may be a function of the particular
sample we have selected.
THE CONFIDENCE INTERVAL (CI)
The formula for the CI is a function of the sample size (N).
For samples sizes ≥ 100, the formula for the CI is:
CI = (the sample mean) + & - Z(s.e.).
Let’s look at an example to see how this formula works.
* Please use a pdf doc. “how to solve the problem”, I have
provided for you under the “notes” link.
Example 1
Suppose that we conducted interviews with 140 randomly
selected individuals (N = 140) in a large metropolitan area. We
assured these individuals that their answers would remain
confidential, and we asked them about their law-breaking
behavior. Among other questions the individuals were asked to
self-report the number of times per month they exceeded the
speed limit. One of the objectives of the study was to estimate
(make an inference about) the average number of times per
month residents in all metropolitan areas across the country
exceeded the speed limit. The sample statistics we obtained
were as follows:
Mean = 12.4 times
4. S = 3.2 times
N = 140
Let’s construct a 95% CI around our estimate of the mean drawn
from this sample.
The sample mean of 12.4 times tells us that, on average, the
individuals from our sample exceed the speed limit about 12.4
times a month. This sample mean estimate is our best point
estimate of the true population mean. We know full well that
12.4 times is not the true population mean and that repeated
samples will yield different means. What does our sample mean
tell us about the mean of the entire population of metropolitan
residents? This is the question we are really trying to answer.
We want to make our point estimate of 12.4 more reliable and at
the same time, give ourselves the ability to make a probability
statement about the confidence we have in our estimate. To do
this, we use the CI equation above to construct a 95%
confidence interval around the sample mean estimate of 12.4.
We have all the information we need to fill in the information
for the formula except for the Z score. The Z score for a 95%
CI is 1.96. From the Z Table, we can find the correct Z score
corresponding to 95 %. Remember that the total area under the
normal distribution/curve equals 100% and that half of that
area, 50%, is above and below the mean. If we are looking for
the Z score corresponding to 95% we first divide 95% in half
leaving a total of 47.5% above and below the mean with 2.5% in
the tail above and below our 95% confidence interval on either
side of the mean. Next we look inside the Z Table (the numbers
5. corresponding to areas under the normal curve) for the number
that comes closest to .4750 (47.5%) without going under .4750
and identify the corresponding Z score. The correct Z score is
1.96 where the area is .4750.
Now we can solve the equation.
95% CI = 12.4 + & - 1.96 (3.2 / √140 – 1)
= 12.4 + & - 1.96 (3.2 / √139)
= 12.4 + & - 1.96 (3.2 / 11.79)
= 12.4 + & - 1.96 (.27)
= 12.4 + & - .53
12.4 - .53 = 11.87
6. 12.4 + .53 = 12.93
95% CI = 11.87 to 12.93
So what does this interval tell us? It tells us that based on our
sample data; we can be 95 percent confident that the mean
number of self-admitted speeding violations among all residents
of metropolitan areas lies between 11.87 and 12.93 times per
month. That is, theoretically speaking, if we had taken a large
number of random samples from this sample population and
calculated 95% confidence intervals around the means obtained
from each sample, approximately 95% of these intervals would
include the true population mean and 5 percent would not.
Example 2
Let’s say for the sake of argument that we only wanted a 90%
CI about our sample mean, rather than a 95% CI for our point
estimate of 12.4. From the Z Table, we can find the correct Z
score corresponding to 90%. Remember that the total area
under the normal distribution/curve equals 100% and that half
of that area, 50%, is above and below the mean. If we are
looking for the Z score corresponding to 90% we first divide
90% in half leaving a total of 45% above and below the mean
with 5% in the tail above and below our 90% confidence
interval on either side of the mean. Next we look inside the Z
Table (the numbers corresponding to areas under the normal
curve) for the number that comes closest to .4500 (45%) without
going under .4500 and identify the corresponding Z score. The
correct Z score is 1.65 where the area is .4505. A Z score of
1.64 would be incorrect because the area of .4495 is less than
45 percent and thus our CI estimate would not truly be a 90%
7. confidence level estimate.
As in example 1 we will insert 1.65 into the CI equation and
solve.
90% CI = 12.4 + & - 1.65 (3.2 / √140 – 1)
= 12.4 + & - 1.65 (3.2 / √139)
= 12.4 + & - 1.65 (3.2 / 11.79)
= 12.4 + & - 1.65 (.27)
= 12.4 + & - .44
8. 12.4 - .44 = 11.96
12.4 + .44 = 12.84
90% CI = 11.96 – 12.84
The interval indicates that we are 90 percent confident that the
true population mean speeding violation score falls between
11.96 and 12.84 times per month. Notice that the interval for a
90% confidence interval is narrower than for a 95% confidence
interval. You can see, then, that we are less confident (90
percent vs. 95 percent confident) that our true population means
falls into this interval. By lowering our level of confidence, we
gained some precision in our estimate. We could reduce the
width of our confidence interval even more, but we would pay
the price in levels of confidence.
Example 3
Let’s say that we took a new sample only this time we randomly
select and interview 901 individuals, asking the same questions.
Our sample data for this sample are:
Sample mean = 12.4 times
10. 12.4 - .18 = 12.22
12.4 + .18 = 12.58
90% CI = 12.22 – 12.58
The interval indicates that we are 90 percent confident that the
true population mean speeding violation score falls between
12.22 and 12.58 times per month. Notice that the interval is
considerably smaller than in Example 2 where the sample size is
140. Why is this? By increasing the sample size, the s.e.
became smaller. We can see this mathematically, but what is
the theoretical reasoning for this change? As our sample size
increased, we captured a greater proportion of the variability in
self-reported speeding violations that exists in the total
population. Consequently, our confidence interval estimate is
more precise. The lesson learned is that whenever you have a
choice between a smaller or a larger sample, choose the larger
sample as your estimates (inferences) about the population will
be more accurate.
Example 4
We have been calculating the confidence interval for samples
where N ≥ 100. What if the sample size is less than 100, N <
11. 100?
In this situation, we must use the two-tailed “T” distribution,
from the Table of T Values. I have provided to you as a pdf
doc. under the “notes” link. We use the two-tailed T
distribution because we are working with a confidence interval
and are concerned with the area between two points on either
side of the mean. This means that we will use the column
headings beneath the label “Level of Significance for Two-
Tailed Test.”
Let’s continue with our effort to estimate the number of self-
reported speeding violations and construct a confidence interval
using the T distribution.
Let’s say we are short on research funds and we are only able to
randomly select and interview 17 individuals and we want to
construct a 90%CI around our estimate of the population mean.
From our sample we obtained the following statistics:
Sample mean = 12.4 times
S = 3.2 times
N = 17
12. The formula we use is the same as that for samples where N ≥
100 except instead of using Z, we use T. The only trick is to
determine which value of T from the Table of T Values we will
use. The first task is to determine the correct column. For a
90% confidence level we will select the column labeled “.10”.
If we wanted a confidence level of 95% we would select the
column labeled “.05”. If we wanted a confidence level of 98%
we would select the column labeled “.02”. If we wanted a
confidence level of 99% we would select the column labeled
“.01”. These levels (.10, .05, .02, .01) represent the total area
remaining in the two tails of the curve that are outside of our
confidence interval. For example, when we construct a 90%
confidence interval 10% of the area under the curve lies outside
the confidence interval boundaries (100 – 90 = 10) and that
remaining 10% is split equally on either side of the boundary
such that 5% remains below the lower boundary of the
confidence interval and 5% remains above the upper boundary
of the confidence interval. The same logic holds true for any
given level of confidence when we are constructing a
confidence interval.
The second task is to select the correct row. To do this we must
calculate something called the degrees of freedom (abbreviated
df). The degrees of freedom (df) = N -1. In this example, df =
17 – 1 = 16. Now we are able to find the appropriate value for
T to insert into our confidence interval formula. The degrees of
freedom are located in the very first column and begin with 1
and go sequentially through 30 and then moves to 40, 60, 120,
and infinity. Go down the column for df until you arrive at 16.
Go across the row for 16 until you are in the column for .10.
That number is 1.746. Now we are ready to construct our 90%
CI.
90% CI = 12.4 + & - 1.746 ( 3.2 / √17 – 1)
14. 90% CI = 11.003 – 13.797
The interval indicates that we are 90 percent confident that the
true population mean speeding violation score falls between
11.003 to 13.797. Notice that the interval is considerably larger
than the intervals in any of the prior examples. This difference
is due to the same phenomenon I discussed in example 3 above
regarding the effect of sample size on the accuracy of our
estimates of the true population mean.