2. Statistical inference: Making
guesses about the population from a sample
Truth (not observable)
N
x
N
i
i
2
12
)(
N
x
N
i
1
Population parameters
1
)( 2
12
n
Xx
s
n
n
i
i
n
x
X
n
i
1
Sample statistics
Sample (Observation)
2
3. Statistics vs. Parameters
› Sample statistic – any summary measure
calculated from data; e.g., could be a mean, a
difference in means or proportions, an odds ratio,
or a correlation coefficient
› Population parameter – the true value/true effect
in the entire population of interest
3
4. Examples of Sample Statistics:
› Mean
› Rate
› Risk
› Difference in means
› Relative risk (odds ratio/ risk ratio…)
› Correlation coefficient
› Regression coefficient
…
4
5. › A single number calculated from our sample data
› How can a single number (e.g., a mean ) have a
distribution?
– Answer: It’s a theoretical concept!
Statistics follow distribution!
– Sampling distribution
5
7. › The sampling distributions are defined by:
• Shape (e.g., normal distribution, T-distribution)
• Mean
• Standard error
7
8. The Central Limit Theorem
If all possible random samples, each of size n, are
taken from any population with a mean and a
standard deviation , the sampling distribution of
the Means will:
1. Have mean:
2. Have standard deviation (standard error):
3. Be approximately normally distributed
regardless of the shape of the parent
population (normality improve with larger n)
x
n
x
x
The mean of the sample meansx
The standard deviation of the sample means. Also called
“the standard error.” - 𝜎 𝑥 𝑆𝐷 𝑥 𝑆𝐸 𝑥 𝑆𝐸𝑀 𝑆𝐸 8
9. n
x
n 1 SEM is always smaller than
SD of the population
n increase variation decreases
Finally, if n is large enough, the sampling
distribution of the mean is approximately
normal!
9
11. Applications Using the Sampling
distribution of the Mean
! Apply tables of standard
normal distribution
Serum cholesterol levels for all 20 – 74-year-old males
in US have:
= 211 mg/dL
= 46 mg/dL
If we select repeated samples of size 25, what
proportion of the samples of size 25 will have mean
value of 230 mg/dL or above?
𝜇 𝑥 = 𝜇 = 211
𝜎 𝑥 =
46
25
= 9.2
𝑧 =
230 − 211
9.2
= 2.07
11
𝑧 =
𝑋 − 𝜇
𝜎
𝒛 =
𝑿 − 𝝁
𝝈/ 𝒏
12. 𝑃 𝑍 < 2.07 = 0.9808
𝑃 𝑍 ≥ 2.07 = 1 − 0.9808
= 0.192
About 1.9% of sample will
have a mean ≥ 230 mg/dL
12
13. Upper and lower limits that enclose 95% of the means
of sample size 25 draw from the population?
𝑃 −1.96 ≤ 𝑍 ≤ 1.96 = 0.95
−1.96 ≤ 𝑍 ≤ 1.96
−1.96 ≤
𝑋 − 211
9.2
≤ 1.96
193.0 ≤ 𝑋 ≤ 229
About 95% of the means of
samples size of 25 lie between
193.0 mg/dL and 229 mg/dL
13
14. How large would the samples need to be for 95% of
their means to lie within 5 mg/dL of the population
mean ?
= 211 mg/dL
= 46 mg/dL
𝑃 𝜇 − 5 ≤ 𝑋 ≤ 𝜇 + 5 = 0.95
1.96SE = 5
1.96
𝜎
𝑛
= 5
𝑛 =
1.96 × 46
5
2
= 326
Samples size of 326 would be
required for 95% of the sample
means to lie within 5 mg/dL of
the population mean
14
15. Practice
Q1. A laboratory value with a mean of 18 g/dL and a standard
deviation of 1.5 implies
a. The true value is between 16.5 and 19.5 g/dL
b. The true value is between 15.0 and 21.0 g/dL
c. The error is too large for the determination to have any value
d. In repeated determination on the same samples, 95% could be
expected to fall between 15.0 and 21.0 g/dL
e. The true value has a 5% chance of being less than 16.5 or more
than 19.5 g/dL
15
16. Q2. Data for patients at a certain hospital show the mean length of
stay is 10 days and the median is 8 days. The most frequent length of
of stay is 6 days. From these facts, we conclude
a. Approximately 50% of the patients stay less than 6 days
b. The distribution of length of stay follows the normal curve
c. The standard deviation is 2 days
d. The mean length of stay is shifted away from the center of the
distribution by stays of very long duration
e. The mean length of stay is shifted away from the center of the
distribution by stays of very short duration
17
17. Q3. A random sample of teenage prenatal patients seen at University
Hospital during 1973 had a mean hematocrit of 29 with a standard
error of 1.5. From this information, we may conclude
a. The hematocrit for any teenage prenatal patient in the sample
will not deviate from the mean by any more than 50%
b. The normal range for teenage prenatal patients seen at
University Hospital is 26 to 32
c. The range of 26 to 32 will include the mean of all teenage
prenatal patients seen at University Hospital in 1973 with 95%
probability
d. It is to be expected that 95% of all teenage prenatal patients seen
seen at University Hospital in 1973 will have hematocrits in the
the range of 26 to 32
18
18. Q4. The IQs of a class of students attending a university are
distributed according to the normal curve, with a mean of 115
115 and a standard deviation of 10. Therefore
a. 50% have IQs between 105 and 115
b. 95% have IQs between 105 than 115
c. 2.5% have IQs above 135
d. 5% have IQs above 135
e. 5% have IQs below 95
19
19. Q5. The primary use of the standard error of the mean is in
calculating the:
a. Confidence interval
b. Error rate
c. Standard deviation
d. Variance
20
20. References
1. Dawson, B., & Trapp, G. R. (2004). Basic & Clinical
Biostatistics (4th edition ed.): Lange Medical Books /
McGraw-Hill.
2. Fisher, L. D., & van Belle, G. (1993). Biostatistics: A
Methodology for the Health Sciences (1st edition ed.):
Wiley.
3. Pagano, M., & Gauvreau, K. (2000). Principles of
Biostatistics (2nd edition ed.): Duxbury Press.
4. Sainani, K. (2014). Statistics in Medicine. Retrieved
May, 2017, from
https://lagunita.stanford.edu/courses/Medicine/MedSt
ats/Summer2014/courseware/8016c68f703d4b888e44
4e97481b6830/71fad5f25fc64e6383bb9cc6be846a2b/
21
Editor's Notes
Statistical inference is all about making guess about a population from a sample.
There is some large population, and there is something we want to know about the population, like what is the effectiveness of the vaccine, what is the mean height of Japanese adult female (called population parameters). But we can not measure everybody, so the truth is not observable.
What we can do instead is we take a subset, a small representative subset of the larger population, that we call a sample.
We can observe the sample, calculate all the measure in the sample: The mean height, the proportion of vaccinated children in the sample. And we call those number calculated from our data Sample statistics.
Then we use those number to guess back to the home population.
On the other hand, a population parameter is the true value or the true effect in the entire population of interest if you can measure it. Of course, usually you can’t actually measure it
There are lots of examples of statistics that we calculate from our data. You can calculate a mean, you can calculate a rate…
More simply, it is the distribution of a statistic across an infinite number of sample
Suppose that in a specified population, the mean of .. Let’s say a test result is muy = 82.5 and the SD is delta. We randomly select a sample of 20 observations from the population and compute the mean of this sample; call the sample mean x1. We then obtain a second random sample of 20 observations and calculate the mean of this new sample, call x2. If we were to continue this procedure indefinitely, we would end up with a set of means.
If we treat each mean as an observation, their collective probability distribution is know as a sampling distribution of means of samples of size 20.
What are the 2 parameters (from last time) that define any normal distribution?
Remember that a normal curve is characterized by two parameters, a mean and a variability (SD)
Remember standard deviation is natural variability of the population. The standard deviation of a statistic is called a SE
Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic.
I would like to talk about maybe one of the most fundamentally and profound concepts in statistics, and that is the Central Limit Theorem.
The mean of the means is equal to population mean
Even if the population is skewed or even bimodal, a sample size 30 is often sufficient.
A sample of 30 is commonly used as a cutoff value because sampling distributions of the mean based on sample sizes of 30 or more are considered to be normally distributed. A sample this large is not always needed, however. If the parent population is normally distributed, the means of samples of any size will be normally distributed.
Who remember the first step ? Liu –san had guided us through last week?
That’s calculate the Z score.
Z score transforms a normally distributed variable with mean muy and SD delta to the standard normal distribution with mean = 0 and SD = 1.
According to the central limit theorem: the mean of sampling distribution is still muy, but SD => SEM the Z score calculating is therefore:
So, assuming that a sample size of 25 is large enough, the central limit theorem states that the distribution of means of samples of size 25 is approximately normal with mean = 211 mg/dL and SE = 46/25
Since we’ve already known that 2.5% of the area lies above z= 1.96 and another 2.5% lies below z=-1.96
Recall that 95% of the area lies between mean 2SD (SE in this case) – actually 1.96
Distribution of proportion:
Shape: Normal distribution if np>5
Mean = true proportion of the population
SE = [p(1-p)/n]