1. Topic –
Chapter 9 -
Sampling and Statistical Inference
SUBJECT - Research Methodology in Civil Engineering - CE541
FACULTY GUIDE- Prof. Amit .A. Amin
PREPARED BY:-
Bhavik A. Shah (17TS809)
CIVIL ENGG. DEPARTMENT
BIRLA VISHVAKARMA MAHAVIDYALAYA ENGG. COLLEGE
VALLABH VIDYANAGAR-388120
M.TECH - TRANSPORTATION ENGINEERING
1
2. Table of Contents
Introduction
Parameter and Statistic
Sampling and Non-Sampling Errors
Sampling Distribution
Degree of Freedom
Standard Error
Central Limit Theorem
Finite Population Correction
Statistical Inference
2
3. Introduction
A population is the collection of all the elements of interest.
A sample is a subset of the population.
Sampling may be defined as the selection of some part of an aggregate or
totality on the basis of which a judgement or inference about the
aggregate or totality is made. In other words, it is the process of obtaining
information about an entire population by examining only a part of it.
3
4. Why sample?
Time of researcher and those being surveyed.
Cost to group or agency commissioning the survey.
Confidentiality, anonymity, and other ethical issues.
Non-interference with population. Large sample could alter the nature of
population, eg. opinion surveys.
Do not destroy population, eg. crash test only a small sample of automobiles.
Cooperation of respondents – individuals, firms, administrative agencies.
Partial data is all that is available, eg. fossils and historical records, climate change.
4
5. NEED FOR SAMPLING
Sampling can save time and money. A sample study is usually less expensive than a
census study and produces results at a relatively faster speed.
Sampling may enable more accurate measurements for a sample study is generally
conducted by trained and experienced investigators.
Sampling remains the only way when population contains infinitely many members.
Sampling remains the only choice when a test involves the destruction of the item
under study.
Sampling usually enables to estimate the sampling errors and, thus, assists in
obtaining information concerning some characteristic of the population.
5
6. Parameter and Statistic
A statistic is a characteristic of a sample, whereas a parameter is a characteristic of
a population. Thus, when we work out certain measures such as mean, median,
mode or the like ones from samples, then they are called statistic(s) for they
describe the characteristics of a sample. But when such measures describe the
characteristics of a population, they are known as parameter(s).
For instance, the population mean (m) is a parameter, whereas the sample mean is a
statistic. To obtain the estimate of a parameter from a statistic constitutes the
prime objective of sampling analysis.
Parameter
= Statistic ± Its Error
6
8. Sampling and Non-Sampling Errors
Sampling error refers to differences between the sample and the population that
exist only because of the observations that happened to be selected for the sample
Increasing the sample size will reduce this type of error.
8
10. Sample Errors
Error caused by the act of taking a sample
They cause sample results to be different from the results of census
Differences between the sample and the population that exist only
because of the observations that happened to be selected for the sample
Statistical Errors are sample error
We have no control over
10
11. Non Sample Errors
Not Control by Sample Size
Non Response Error
Response Error
11
12. Non Response Error
A non-response error occurs when units selected as part of
the sampling procedure do not respond in whole or in part.
12
13. Response Errors
A response or data error is any systematic bias that occurs during data
collection, analysis or interpretation.
Respondent error (e.g., lying, forgetting, etc.)
Interviewer bias
Recording errors
Poorly designed questionnaires
Measurement error
13
14. Respondent error
respondent gives an incorrect answer, e.g. due to prestige or competence
implications, or due to sensitivity or social undesirability of question
respondent misunderstands the requirements
lack of motivation to give an accurate answer
“lazy” respondent gives an “average” answer
question requires memory/recall
proxy respondents are used, i.e. taking answers from someone other than
the respondent
14
15. Interviewer bias
Different interviewers administer a survey in different ways
Differences occur in reactions of respondents to different interviewers, e.g.
to interviewers of their own sex or own ethnic group
Inadequate training of interviewers
Inadequate attention to the selection of interviewers
There is too high a workload for the interviewer
15
16. Measurement Error
The question is unclear, ambiguous or difficult to answer
The list of possible answers suggested in the recording instrument is
incomplete
Requested information assumes a framework unfamiliar to the respondent
The definitions used by the survey are different from those used by the
respondent (e.g. how many part-time employees do you have? See next
slide for an example)
16
17. Key Points on Errors
Non-sampling errors are inevitable in production of national statistics.
Important that:-
At planning stage, all potential non-sampling errors are listed and steps taken
to minimise them are considered.
If data are collected from other sources, question procedures adopted for data
collection, and data verification at each step of the data chain.
Critically view the data collected and attempt to resolve queries immediately
they arise.
Document sources of non-sampling errors so that results presented can be
interpreted meaningfully.
17
18. Sampling Distributions
Sampling Distribution of Mean
Student’s ‘t’ Distribution
Sampling Distribution of Proportion
F Distribution
Chi-square Distribution
18
19. Sampling distribution of mean
Mean calculated from a sample is usually the best guess for population mean. But
different samples give different sample means!
It can be shown that sample means from samples of size n are normally distributed:
Term is called standard error (standard deviation of sample means).
),(
n
N
n
1x
2x
3x
19
20. CONT…
Sample mean comes from the normal distribution above.
Knowing normal distribution properties, we can be 95% sure that sample mean is in
the range:
),(
n
N
n
x
n
96,196,1
20
21. CONT…
If population standard deviation is unknown then it can be shown that
sample means from samples of size n are t-distributed with n-1 degrees of
freedom
As an estimate for standard error we can use
n
s
21
22. T-distribution
T-distribution is quite similar to normal distribution, but the exact shape of
t-distribution depends on sample size
When sample size increases then t-distribution approaches normal
distribution
T-distribution’s critical values can be calculated with Excel
=TINV(probability ; degrees of freedom)
In the case of error margin for mean degrees of freedom equals n – 1
(n=sample size)
Ex. Critical value for 95% confidence level when sample size is 50:
=TINV(0,05;49) results 2,00957
22
23. Sampling Distribution of Proportion
Proportion calculated from a sample is usually the best guess for
population proportion. But different samples give different sample
proportions!
It can be shown that proportions from samples of size n are normally
distributed
Standard error (standard deviation of sample proportions) is
As an estimate for standard error we use
)
)1(
,(
n
N
n
)1(
n
pp )1(
23
24. Error margin for proportion
Based on the sampling distribution of proportion we can be
95% sure that population proportion is (95% confidence
interval)
n
pp
p
n
pp
p
)1(
96,1
)1(
96,1
24
27. Degree of Freedom
In statistics, the number of degrees of freedom is the number of values in the final
calculation of a statistic that are free to vary.
The number of independent ways by which a dynamic system can move, without
violating any constraint imposed on it, is called number of degrees of freedom. In
other words, the number of degrees of freedom can be defined as the minimum
number of independent coordinates that can specify the position of the system
completely.
df = n - 1
27
28. Standard Error
The Standard Deviation of sampling distribution of a statistic is known as its
standard error (S.E) and is considered the key to sampling theory.
The utility of the concept of standard error in statistical induction arises on account
of the following reasons:
The Standard error helps in testing whether the difference between observed and
expected frequencies could arise due to chance.
The standard error gives an idea about the reliability and precision of a sample. The
smaller the S.E., the greater the uniformity of sampling distribution and hence, greater is
the reliability of sample.
The standard error enables us to specify the limits within which the parameters of the
population are expected to lie with a specified degree of confidence. Such an interval is
usually known as confidence interval.
28
30. Central Limit Thereom
When sampling is from a normal population, the means of samples drawn from
such a population are themselves normally distributed. But when sampling is not
from a normal population, the size of the sample plays a critical role. When n is
small, the shape of the distribution will depend largely on the shape of the parent
population, but as n gets large (n > 30), the shape of the sampling distribution will
become more and more like a normal distribution, irrespective of the shape of the
parent population.
The theorem which explains this sort of relationship between the shape of the
population distribution and the sampling distribution of the mean is known as the
central limit theorem.
“The significance of the central limit theorem lies in the fact that it permits us to
use sample statistics to make inferences about population parameters without
knowing anything about the shape of the frequency distribution of that population
other than what we can get from the sample.”
30
31. Finite Population Correction
The Finite Population Correction Factor (FPC) is used when you sample without
replacement from more than 5% of a finite population.
It’s needed because under these circumstances, the Central Limit Theorem doesn’t
hold and the standard error of the estimate (e.g. the mean or proportion) will be
too big.
In basic terms, the FPC captures the difference between sampling with replacement
and sampling without replacement.
FPC = ((N-n)/(N-1))1/2
31
32. CONT…
The following table of values shows how the FPC decreases for a population of 10,000
as the sample size gets larger:
32
34. Inference
Two ways to make inference
Estimation of parameters
* Point Estimation (X or p)
* Intervals Estimation
Hypothesis Testing
34
35. Mean, , is
unknown
Population Point estimate
I am 95%
confident that
is between 40 &
60
Mean
X = 50
Sample
Interval estimate
Estimation of parameters35
39. 95% Samples
X
_
X - 1.96 SE X + 1.96 SE
SESE Z-axis
1 - α
α/2α/2
Confidence Interval39
40. 95% Samples
SESE p
p + 1.96 SEp - 1.96 SE
Z-axis
1 - α
α/2α/2
Confidence Interval40
41.
42. Example (Sample size≥30)
An epidemiologist studied the blood glucose level of a random sample of
100 patients. The mean was 170, with a SD of 10.
SE = 10/10 = 1
Then CI:
= 170 + 1.96 1 168.04 ≥ 171.96
95
%
= X + Z SE
42
43. Hypothesis testing
A statistical method that uses sample data to evaluate a
hypothesis about a population parameter. It is intended to
help researchers differentiate between real and random
patterns in the data.
43
44. What is a Hypothesis?
An assumption
about the
population
parameter.
I assume the mean SBP of
participants is 120 mmHg
44
45. Null & Alternative Hypotheses
H0 Null Hypothesis states the Assumption to be tested e.g. SBP of
participants = 120 (H0: m = 120).
H1 Alternative Hypothesis is the opposite of the null hypothesis (SBP of
participants ≠ 120 (H1: m ≠ 120). It may or may not be accepted and it is
the hypothesis that is believed to be true by the researcher
45
46. Level of Significance, a
Defines unlikely values of sample statistic if null hypothesis is
true. Called rejection region of sampling distribution
Typical values are 0.01, 0.05
Selected by the Researcher at the Start
Provides the Critical Value(s) of the Test
46
48. H0: Innocent
Jury Trial Hypothesis Test
Actual Situation Actual Situation
Verdict Innocent Guilty Decision H0 True H0 False
Innocent Correct Error
Accept
H0
1 - a
Type II
Error (b )
Guilty Error Correct
H0
Type I
Error
(a )
Power
(1 - b)
False
Negative
False
Positive
Reject
Result Possibilities48
49. Hypothesis Testing: Steps
Test the Assumption that the true mean SBP of participants is 120 mmHg.
State H0 H0 : m = 120
State H1 H1 : m 120
Choose a a = 0.05
Choose n n = 100
Choose Test: Z, t, X2 Test
49
50. Hypothesis Testing: Steps
Compute Test Statistic
Search for Critical Value
Make Statistical Decision rule
Express Decision
50
51. One sample-mean Test
Assumptions
Population is normally distributed
t test statistic
n
s
x
t 0
errorstandard
valuenullmeansample
51
52. Example Normal Body Temperature
What is normal body temperature? Is it actually 37.6oC (on average)?
State the null and alternative hypotheses
H0: = 37.6oC
Ha: 37.6oC
52
53. Example Normal Body Temp (cont)
n
s
x
t 0
errorstandard
valuenullmeansample
Data: random sample of n = 18 normal body temps
37.2 36.8 38.0 37.6 37.2 36.8 37.4 38.7 37.2
36.4 36.6 37.4 37.0 38.2 37.6 36.1 36.2 37.5
Variable n Mean SD SE t P
Temperature 18 37.22 0.68 0.161 2.38 0.029
Summarize data with a test statistic
53
55. Example Normal Body Temp (cont)
Find the p-value
df = n – 1 = 18 – 1 = 17
From SPSS: p-value = 0.029
From t Table: p-value is between
0.05 and 0.01.
Area to left of t = -2.11 equals
area to right of t = +2.11.
The value t = 2.38 is between
column headings 2.110& 2.898
in table, and for df =17, the p-
values are 0.05 and 0.01.
-2.11 +2.11 t
55
56. Example Normal Body Temp (cont)
Decide whether or not the result is statistically significant based on the p-
value
Using a = 0.05 as the level of significance criterion, the results are
statistically significant because 0.029 is less than 0.05. In other words, we
can reject the null hypothesis.
Report the Conclusion
We can conclude, based on these data, that the mean temperature in the
human population does not equal 37.6.
56
57. Case Study: - STATISTICAL INFERENCE OF A CASE
STUDY IN CHINA: ACTIVE PHOSPHATE REMOVAL
FROM EUTROPHIC WATER
China is a country that exports a huge amount of duck meat. Recently, more and
more people raise ducks in ponds together with fish. Previous research has shown
that the yield of fish in a duckfish integrated system pond is greater than the yield
in non-integrated system ponds.
At the same time, the duck-fish system reduced the pollution significantly.
However, there is still polluted water left due to the entering phosphorous and
nitrate from ducks (Adel K. Soliman, 2000)
57
58. Experimental Design and Sample
Collection
Experimental Design and Sample Collection
The experiments were performed in Anhui, China. Three ponds, A, B and C, were
selected.
Pond A is our treatment pond where we planted the water spinach. It had ducks
and fishes. Pond B is a pond with ducks and fishes without water spinach. Pond C is
the control pond with fishes only. We built a floating bed of size 5m*1.2m to fix the
water spinach in pond A.
58
59. Sample Collection
The samples were obtained each of these locations in three ponds:
A1: concentration from water within water spinach area in pond A;
A2: concentration from water outside water spinach area in pond A;
B1: concentration from water under duck sheds in pond B;
B2: concentration from water away from duck sheds in pond B;
C1: concentration from water in pond C without duck or water spinach.
59
60. Data Analysis Result
When we plot the measurements from a same pond, we get Figure 3. The
observations for both ammonia nitrogen and active phosphate from A1
continuously decrease. The observations for ammonia-nitrogen from C1 do not
show decreasing trend.
60
61. Conclusion
We performed multiple paired t-test to compare the mean concentrations of
ammonia-nitrogen at various locations. The p-value between samples from A1 and
A2 is greater than 0.1, so there is no real difference in the concentration of
ammonia-nitrogen at two different locations within pond A.
There is a real difference in the concentration of active phosphate at two different
locations within pond B. The significance test recommends that the water near the
ducks is more polluted by the active phosphate content than the water elsewhere.
The p-value between samples from A1 and B2 is close to 0.0005, so there is a
significant evidence that planting water spinach reduces the active phosphate
content in the water.
61
62. Reference
CR Kothari - Research Methodology Methods and Techniques , 2nd Revised edition,
New Age International Publishers.
June Luo, Ling Zu - STATISTICAL INFERENCE OF A CASE STUDY IN CHINA: ACTIVE
PHOSPHATE REMOVAL FROM EUTROPHIC WATER, Department of Applied
Economics and Statistics, Clemson University.
https://www.slideshare.net/rambhu21/sampling-and-sampling-errors-19870549/62
62