2. What is statistics?
• We collect data from the real
world to test our hypothesis on a
phenomenon.
• To test these hypothesis, we build
‘statistical models’ or scaled-down
version to determine its fitness to
the situation of interest.
• In short, we want to know if the
statistical model is an accurate
representation of the real world or
data we collected or observed.
3. Concept of sampling
We cannot taste each and every
lanzones in the basket.
We get a sample, and hope that
it is a good representative of the
population.
We infer things from a general
population.
4. In statistics, we are interested in finding results thru a ‘sample’ that can generalize in the
entire ‘population’.
We get a small subset of the population known as the SAMPLE and use these data to infer
things about the population as a whole.
5. Inferential statistics
Process of drawing conclusions about the properties of a population
based on the information obtained from the sample.
Difficulty is to determine which statistical model is the most
appropriate
6. Slovin’s formula
whereas:
n = no. of samples
N = total population
e = error margin / margin of error
There are 1000 employees in the organization. You want to conduct a satisfaction survey with
the margin of error of 0.05 (5%). Using Slovin’s formula, you need to survey:
9. Parameters & statistics
Parameters are characteristics of “Population.”
μis the population mean
σis the population standard deviation sis the sample standard deviation
Statistics are characteristics of “Sample.”
11. • The sample mean is very close to the
population mean. The accuracy increases
when the samples taken increase as well.
• If we take the mean of ALL SAMPLE MEANS,
it will be equal to the POPULATION MEAN
Standard error (SE) = standard deviation of the
SAMPLE MEANS from the POPULATION MEAN
12. The sampling distribution of all
possible ‘sample means’ can
help generalize what the
‘population mean’ would likely
be.
13. s
X -
68.26%
95.44%
99.74%
Sampling Distribution of Means is
a NORMAL CURVE.
A normal curve is a
Probability Distribution,
which shows the likelihood of
cases as we travel away from
the means of means (true
population mean).
It is hard….but wait.
14. Sampling distribution of means
As long as the sample size is reasonable large (N >30), the sampling
distribution of means approximates that of a normal curve.
In short, if the sample is drawn from a reasonable large number of
cases that is NORMALLY DISTRIBUTED, the SAMPLING DISTRIBUTION
shall be normally distributed regardless if the raw score in the mean.
The mean of the sampling distribution (the means of means)
becomes closer to the true population mean μ.
The standard deviation s of a sampling distribution of means is
smaller than the standard deviation of the population σ.
15. Two theories
• Central Limit Theorem: if you obtain repeated samples (n) from a
population (N) that is normally distributed, with “population” mean
(µ) and SD (σ), the SAMPLING DISTRIBUTION OF MEANS (s) SHALL BE
NORMALLY DISTRIBUTED.
• Law of Large Numbers: if you obtain repeated samples (n) from a
population of whatever distribution, with mean (µ) and SD (σ), if the
sample is large enough even if not normally distributed, then the
SAMPLING DISTRIBUTION OF MEAN COULD BE ASSUMED AS
NORMALLY DISTRIBUTED
16. Confidence interval
• Through the SE, we can find the Range of values within which the
POPULATION MEAN is likely to fall.
• We can use the SAMPLE MEAN as an estimate of the POPULATION MEAN
and find the Range within which there is either 99% or 95% probability
(chance) that the population mean will fall.
About 99%
outliers
outliers
About 95%
outliers
outliers
Z = +2.58
Z = - 2.58 Z = - 1.96 Z = +1.96
μ μ
99% CI = +/- (2.58 x SE)
95% CI = +/- (1.96 x SE)
2.5%
2.5%
0.5%
0.5%
18. Sampling distribution
• This is the different sample means
plotted as a symmetrical distribution.
• It tells us how representative a sample is
of the population.
• Hence, the Standard Error is the standard
deviation of the sample means.
• We obtain this by…
• If the SE is large, then there is a large
variability between the sample means
and so the sample might not represent
the population.
μ = 6.9
19. Single sample test of the MEAN
Evaluate the performance of a given population against a ‘standard’
from the sample results.
Z-test for a Single Sample Test of the Mean (N>30)
T-test for a Single Sample Test of the Mean (N<30)
Z-test for Single Sample Test of the Proportion
20. Z = +1.96
Z = - 1.96
μ
95% CI = +/- (1.96 x SE)
2.5%
Confidence interval / Standard Error
47.5%
95%
2.5%
47.5%
Z = +1.96
Z = - 1.96
μ
97% CI = +/- (2.17 x SE)
2.5%
47.5%
97%
2.5%
48.50%
Z = - 2.17
1.5%
21.
22. There was random sample of 64 Local Government Units
(LGUs) selected. A standard has been set that LGUs
contractual personnel paid thru their MOOE should not
be more than 60 percent. In your sample, the Mean is 55
percent, and standard deviation (σ) is 12. DO YOU THINK
THAT the LGUs are exceeding the standard percentage of
hiring contractual employees.
Ho : µ = 60
Ha : µ < 60
Z-test for single sample test of Mean (n > 30)
23. (60)
0
2.5%
.05%
+/- 1.96
49.95%
z
Area between
Mean & z
Area beyond &
z
3.30 49.95 0.05
Table A. Percentage ofArea under the Normal Curve
z = -3.33
95%
2.5%
DECISION CRITERIA : With α = .05, we REJECT the Ho, if Z > +/-1.96
+/- 1.96
24.
25. You got a sample population of 1,500 informal settlers in Novaliches. PSA
claims that those at the bottom 30 percent of the population spend 59.7%
of their income on food. In your sample, the Mean is 60.7 percent, and
standard deviation (σ) is 12.4. DO YOU THINK THAT the informal settlers are
exceeding the standard expenses for food?
26. z
Area between
Mean & z
Area beyond &
z
3.12 49.91 0.09
Table A. Percentage ofArea under the Normal Curve
DECISION CRITERIA : With α = 5%, we REJECT the Null Hypothesis, if Z >+/-1.96
(60)
0
2.50%
.09% or .1%
+/- 1.64
49.9%
z = 3.12
95%
5.00%
DECISION CRITERIA : With α = 10%, we REJECT the Ho, if Z > +/-1.64
+/- 1.96
5.00%
2.50%
27.
28. In an IQ test conducted in class, 24 sample students
were selected. It found that the average IQ was 94. But
in the admission test, the standard IQ was not less than
100. The standard deviation (σ) among the samples is
12. Do you think that the IQ of the sample represents
that of the entire batch?
Ho : µ = 100
Ha : µ < 100
T-test for single sample test of Mean (n < 30)
29. z
Area between
Mean & z
Area beyond &
z
2.40 49.18 0.82
Table A. Percentage ofArea under the Normal Curve
(60)
0
2.50%
.82%
+/- 1.96
49.18%
z = -2.40
95%
2.50%
DECISION CRITERIA : With α = .05 or 5%, we REJECT the Ho, if Z > +/-1.96
30. z
Area between
Mean & z
Area beyond &
z
2.40 49.18 0.82
Table A. Percentage ofArea under the Normal Curve
(60)
0
2.50%
.82%
+/- 1.96
49.18%
z = -2.40
95%
2.50%
DECISION CRITERIA : With α = .05 or 5%, we REJECT the Ho, if Z > +/-1.96
31. In measuring the Body Mass Index (BMI), you got 20
sample of students. It found that the average BMI was
23.7. However, the school’s health officials claim that the
standard is 18.1 (normal). The standard deviation (σ)
among the samples is 8. Do you think that the BMI of the
sample represents that of the entire student population?
Ho : µ = 18.1
Ha : µ > 18.1
T-test for single sample test of Mean (n < 30)
32. z
Area between
Mean & z
Area beyond &
z
3.05 49.89 0.11
Table A. Percentage ofArea under the Normal Curve
DECISION CRITERIA : With α = 5%, we REJECT the Null Hypothesis, if Z >+/- 1.96
(18.1)
0
2.50%
.11%
+/- 1.67
49.89%
z = 3.05
95%
5.00%
DECISION CRITERIA : With α = 10%, we REJECT the Ho, if Z > +/-1.67
+/- 1.97
5.00%
2.50%
33.
34. You took a poll of 2,000 voters in Manilaand found out
that out of two candidates for office he will obtain 54
percent of the votes sampled. DO YOU THINK HE WILL
WIN?
Ho : Pµ = .50
Ha : Pµ > .50
Z-test for single sample test of PROPORTION (n > 30)
35. Ho : Pµ = .50
Ha : Pµ > .50
z
Area between
Mean & z
Area beyond &
z
3.60 49.98 0.02
Table A. Percentage ofArea under the Normal Curve
DECISION CRITERIA : With α = 5%, we REJECT the Null Hypothesis, if Z >+/-1.96
(60)
0
2.50%
.02%
+/- 1.64
49.98%
z = 3.60
95%
5.00%
DECISION CRITERIA : With α = 10%, we REJECT the Ho, if Z > +/-1.64
+/- 1.96
5.00%
2.50%
36.
37. In the City of San Ignacio, the population mean of
income (μ ) is P18,299. To know if this is true, we got 3
sample sets, with a sample means of distribution at
P24,611, P23,667 & P23,056. Now we want to know the
‘probability’ or likely occurrence that our sample means
approximate the population mean. Your σ is P11,607.
Go to 101_Sampling & Population_workshop_Sheet_Income_2.
38. n 65
POPULATION MEAN 18,229
POPULATION SD 11,607
Z SCORE
SET A 24,611 24,611 - 18,299 6,382 4.40
11,607 / sqrt (65 - 1) 1,451 -
Z SCORE
SET A 23,667 23,667 - 18,299 5,438 3.75
11,607 / sqrt (65 - 1) 1,451 -
Z SCORE
SET A 23,056 23056 - 18,299 4,827 3.33
11,607 / sqrt (65 - 1) 1,451 -
Table A. Percentage of Area under the Normal Curve
(a) (c)
Z score Area beyond Z
4.40 0.003
3.75 0.010
3.33 0.05
Area between mean & Z
(b)
49.977
49.990
49.950
49.95%
P18,229
0.05%
Z = 3.33
Define the decision criteria if Margin
of error is 5%, 1% and 10%?
Z = 3.75 Z = 4.40
+/- 1.97
2.50%
39. Rule of thumb
The sample mean is very close to the population mean. The
accuracy increases when the samples taken increase as well.
(1) If the DIFFERENCE BETWEEN MEANS lies farther from 0 (means of
difference), it has small probability of occurrence, thus we RETAIN the
NULL HYPOTHESIS. Why? The difference of means might be the result
of a sampling error
(2) If it is close to 0, the probability is large, ergo we ACCEPT the
ALTERNATIVE HYPOTHESIS. It is statically significant to ignore.
40. Level of significance
When do we reject someone after
giving him so many chances to prove his
worth?
Oftentimes, we reject a love when the
person invades or crosses something
critically significant to us.
This is the concept of Level of
Significance, denoted as α = .05 or 5%.
α = .05 is the level of probability that
the Null Hypothesis can be rejected
(when they cross the line), and
ALTERNATIVE HYPOTHESIS can be
accepted.
41. Null vs Alternative Hypothesis
Null hypothesis (H0) means that (μA = μB)… “THERE IS NO
DIFFERENCE BETWEEN THE POPULATION MEAN & SAMPLE MEAN. If
there are any differences, discrepancies, or suspiciously outlying
results they are purely due to sampling errors".
Alternative hypothesis (Ha) means that : (μA ≠ μB) “THERE IS A BIG
DIFFERENCE, or the difference between population and samples are
too large to ignore and statistically significant.
42. If the Probability < .05, we reject the NULL since the
probability is too small (less than 5 chances out of 100)
that the sampling difference is a result of sampling error.
P value <.05, WE REJECT
In the same way, a .05 level of significance is associated
with z score = 1.96 in either tail of the normal curve. In
other words, the difference between means fall between
-1.96 σX̅1 - X̅2 to +1.96 σX̅1 - X̅2. Only 5% fall out of the
cut-off. These shaded regions are the CRITICAL OR
REJECTION REGIONS
This tells us that 1.96 standard deviations from the mean,
95% (47.50%+47.50%) of the difference lie between the
two samples. Only 5% fall at or beyond this point
(2.5%+2.5% = 5%) at both tail end
0
2.50%
Z = +/- 1.97
47.5%
95%
2.50%
47.5%
43. We can even set a more conservative or stringent level of
significance, whereby we reject the NULL HYPOTHESIS if
it is less than 1 chance out of 1000. This is α = .01 which
is very conservative. The z = 2.58.
(a) (b) (c)
z score Area between Mean and Z Area beyond z
2.58 49.51 0.49
(a) (b) (c)
z score Area between Mean and Z Area beyond z
1.96 47.50 2.50
TABLE A. Percentage of Area Under the Normal Curve
0
2.50%
Z = +/- 1.97
47.5%
95%
2.50%
49.5%
0.5%
Z = +/- 2.58
44. Types of errors
By the way when we reject the NULL HYPOTHESIS, we open ourselves to two
kinds of errors
Type 1 error is when we reject the null, and it is true. If our level of significance
(α) is .01, there is 1 chance out of 100 of making the wrong decision.
Type 2 error is we retain the null and it is actually false. To avoid this, we
increase the size of the sampling population so that it is more represented.
Again let us go back to level of significance (α = .05.). The P (probability) refers
to the actual cases drawn from the data. As researchers, we set the level of
significance as a threshold below which the NULL HYPOTHESIS is rejected since
the probability is so small.
Alpha value α is the size of the tail region under the curve that makes us reject
the Null Hypothesis.
45. In this case, notwithstanding the critical value is
1.70% or .017, we chose .5% or .05. Thus, if the P <
.05, we reject the Null Hypothesis. We may commit
the Type Error I.
(a) (b) (c)
z score Area between Mean and Z Area beyond z
1.96 47.50 2.50
TABLE A. Percentage of Area Under the Normal Curve
(a) (b) (c)
z score Area between Mean and Z Area beyond z
2.12 48.30 1.70
0
95.0%
47.5%
1.7%
Z = +/- 2.12
Z = +/- 1.97
2.5%
48.3%