Sampling and statistical inference

Topic –
Chapter 9 -
Sampling and Statistical Inference
SUBJECT - Research Methodology in Civil Engineering - CE541
FACULTY GUIDE- Prof. Amit .A. Amin
PREPARED BY:-
Bhavik A. Shah (17TS809)
CIVIL ENGG. DEPARTMENT
BIRLA VISHVAKARMA MAHAVIDYALAYA ENGG. COLLEGE
VALLABH VIDYANAGAR-388120
M.TECH - TRANSPORTATION ENGINEERING
1

Table of Contents
 Introduction
 Parameter and Statistic
 Sampling and Non-Sampling Errors
 Sampling Distribution
 Degree of Freedom
 Standard Error
 Central Limit Theorem
 Finite Population Correction
 Statistical Inference
2

Introduction
 A population is the collection of all the elements of interest.
 A sample is a subset of the population.
 Sampling may be defined as the selection of some part of an aggregate or
totality on the basis of which a judgement or inference about the
aggregate or totality is made. In other words, it is the process of obtaining
information about an entire population by examining only a part of it.
3

Why sample?
 Time of researcher and those being surveyed.
 Cost to group or agency commissioning the survey.
 Confidentiality, anonymity, and other ethical issues.
 Non-interference with population. Large sample could alter the nature of
population, eg. opinion surveys.
 Do not destroy population, eg. crash test only a small sample of automobiles.
 Cooperation of respondents – individuals, firms, administrative agencies.
 Partial data is all that is available, eg. fossils and historical records, climate change.
4

NEED FOR SAMPLING
 Sampling can save time and money. A sample study is usually less expensive than a
census study and produces results at a relatively faster speed.
 Sampling may enable more accurate measurements for a sample study is generally
conducted by trained and experienced investigators.
 Sampling remains the only way when population contains infinitely many members.
 Sampling remains the only choice when a test involves the destruction of the item
under study.
 Sampling usually enables to estimate the sampling errors and, thus, assists in
obtaining information concerning some characteristic of the population.
5

Parameter and Statistic
 A statistic is a characteristic of a sample, whereas a parameter is a characteristic of
a population. Thus, when we work out certain measures such as mean, median,
mode or the like ones from samples, then they are called statistic(s) for they
describe the characteristics of a sample. But when such measures describe the
characteristics of a population, they are known as parameter(s).
 For instance, the population mean (m) is a parameter, whereas the sample mean is a
statistic. To obtain the estimate of a parameter from a statistic constitutes the
prime objective of sampling analysis.
Parameter
= Statistic ± Its Error
6

ParameterStatistic
Mean:
Standard
deviation:
Proportion:
s
X ____
____
____
estimates
estimates
estimates
from sample
from entire
population
p
7

Sampling and Non-Sampling Errors
 Sampling error refers to differences between the sample and the population that
exist only because of the observations that happened to be selected for the sample
Increasing the sample size will reduce this type of error.
8

Types of Sampling Error
 Sample Errors
 Non Sample Errors
9

Sample Errors
 Error caused by the act of taking a sample
 They cause sample results to be different from the results of census
 Differences between the sample and the population that exist only
because of the observations that happened to be selected for the sample
 Statistical Errors are sample error
 We have no control over
10

Non Sample Errors
 Not Control by Sample Size
 Non Response Error
 Response Error
11

Non Response Error
 A non-response error occurs when units selected as part of
the sampling procedure do not respond in whole or in part.
12

Response Errors
 A response or data error is any systematic bias that occurs during data
collection, analysis or interpretation.
Respondent error (e.g., lying, forgetting, etc.)
Interviewer bias
Recording errors
Poorly designed questionnaires
Measurement error
13

Respondent error
 respondent gives an incorrect answer, e.g. due to prestige or competence
implications, or due to sensitivity or social undesirability of question
 respondent misunderstands the requirements
 lack of motivation to give an accurate answer
 “lazy” respondent gives an “average” answer
 question requires memory/recall
 proxy respondents are used, i.e. taking answers from someone other than
the respondent
14

Interviewer bias
 Different interviewers administer a survey in different ways
 Differences occur in reactions of respondents to different interviewers, e.g.
to interviewers of their own sex or own ethnic group
 Inadequate training of interviewers
 Inadequate attention to the selection of interviewers
 There is too high a workload for the interviewer
15

Measurement Error
 The question is unclear, ambiguous or difficult to answer
 The list of possible answers suggested in the recording instrument is
incomplete
 Requested information assumes a framework unfamiliar to the respondent
 The definitions used by the survey are different from those used by the
respondent (e.g. how many part-time employees do you have? See next
slide for an example)
16

Key Points on Errors
 Non-sampling errors are inevitable in production of national statistics.
Important that:-
 At planning stage, all potential non-sampling errors are listed and steps taken
to minimise them are considered.
 If data are collected from other sources, question procedures adopted for data
collection, and data verification at each step of the data chain.
 Critically view the data collected and attempt to resolve queries immediately
they arise.
 Document sources of non-sampling errors so that results presented can be
interpreted meaningfully.
17

Sampling Distributions
 Sampling Distribution of Mean
 Student’s ‘t’ Distribution
 Sampling Distribution of Proportion
 F Distribution
 Chi-square Distribution
18

Sampling distribution of mean
 Mean calculated from a sample is usually the best guess for population mean. But
different samples give different sample means!
 It can be shown that sample means from samples of size n are normally distributed:
 Term is called standard error (standard deviation of sample means).
),(
n
N


n

1x
2x
3x

19

CONT…
Sample mean comes from the normal distribution above.
Knowing normal distribution properties, we can be 95% sure that sample mean is in
the range:
),(
n
N


n
x
n



  96,196,1
20

CONT…
 If population standard deviation is unknown then it can be shown that
sample means from samples of size n are t-distributed with n-1 degrees of
freedom
 As an estimate for standard error we can use
n
s
21

T-distribution
 T-distribution is quite similar to normal distribution, but the exact shape of
t-distribution depends on sample size
 When sample size increases then t-distribution approaches normal
distribution
 T-distribution’s critical values can be calculated with Excel
=TINV(probability ; degrees of freedom)
 In the case of error margin for mean degrees of freedom equals n – 1
(n=sample size)
 Ex. Critical value for 95% confidence level when sample size is 50:
=TINV(0,05;49) results 2,00957
22

Sampling Distribution of Proportion
 Proportion calculated from a sample is usually the best guess for
population proportion. But different samples give different sample
proportions!
 It can be shown that proportions from samples of size n are normally
distributed
 Standard error (standard deviation of sample proportions) is
 As an estimate for standard error we use
)
)1(
,(
n
N



n
)1(  
n
pp )1( 
23

Error margin for proportion
 Based on the sampling distribution of proportion we can be
95% sure that population proportion is (95% confidence
interval)
n
pp
p
n
pp
p
)1(
96,1
)1(
96,1



 
24

Degree of Freedom
 In statistics, the number of degrees of freedom is the number of values in the final
calculation of a statistic that are free to vary.
 The number of independent ways by which a dynamic system can move, without
violating any constraint imposed on it, is called number of degrees of freedom. In
other words, the number of degrees of freedom can be defined as the minimum
number of independent coordinates that can specify the position of the system
completely.
 df = n - 1
27

Standard Error
 The Standard Deviation of sampling distribution of a statistic is known as its
standard error (S.E) and is considered the key to sampling theory.
 The utility of the concept of standard error in statistical induction arises on account
of the following reasons:
 The Standard error helps in testing whether the difference between observed and
expected frequencies could arise due to chance.
 The standard error gives an idea about the reliability and precision of a sample. The
smaller the S.E., the greater the uniformity of sampling distribution and hence, greater is
the reliability of sample.
 The standard error enables us to specify the limits within which the parameters of the
population are expected to lie with a specified degree of confidence. Such an interval is
usually known as confidence interval.
28

Central Limit Thereom
 When sampling is from a normal population, the means of samples drawn from
such a population are themselves normally distributed. But when sampling is not
from a normal population, the size of the sample plays a critical role. When n is
small, the shape of the distribution will depend largely on the shape of the parent
population, but as n gets large (n > 30), the shape of the sampling distribution will
become more and more like a normal distribution, irrespective of the shape of the
parent population.
 The theorem which explains this sort of relationship between the shape of the
population distribution and the sampling distribution of the mean is known as the
central limit theorem.
 “The significance of the central limit theorem lies in the fact that it permits us to
use sample statistics to make inferences about population parameters without
knowing anything about the shape of the frequency distribution of that population
other than what we can get from the sample.”
30

Finite Population Correction
 The Finite Population Correction Factor (FPC) is used when you sample without
replacement from more than 5% of a finite population.
 It’s needed because under these circumstances, the Central Limit Theorem doesn’t
hold and the standard error of the estimate (e.g. the mean or proportion) will be
too big.
 In basic terms, the FPC captures the difference between sampling with replacement
and sampling without replacement.
 FPC = ((N-n)/(N-1))1/2
31

CONT…
The following table of values shows how the FPC decreases for a population of 10,000
as the sample size gets larger:
32

Statistical Inference
Use a random sample to
learn something about a
larger population
33

Inference
 Two ways to make inference
 Estimation of parameters
 * Point Estimation (X or p)
 * Intervals Estimation
 Hypothesis Testing
34

Mean, , is
unknown
Population Point estimate
I am 95%
confident that 
is between 40 &
60
Mean
X = 50
Sample
Interval estimate
Estimation of parameters35

Parameter
= Statistic ± Its Error
36

Sampling Distribution
X or PX or P X or P
37

Standard Error
SE (Mean) =
S
n
SE (p) =
p(1-p)
n
Quantitative Variable
Qualitative Variable
38

95% Samples
X
_
X - 1.96 SE X + 1.96 SE
 SESE Z-axis
1 - α
α/2α/2
Confidence Interval39

95% Samples
SESE  p
p + 1.96 SEp - 1.96 SE
Z-axis
1 - α
α/2α/2
Confidence Interval40

Example (Sample size≥30)
 An epidemiologist studied the blood glucose level of a random sample of
100 patients. The mean was 170, with a SD of 10.
 SE = 10/10 = 1
 Then CI:
  = 170 + 1.96  1 168.04   ≥ 171.96
95
%
 = X + Z SE
42

Hypothesis testing
 A statistical method that uses sample data to evaluate a
hypothesis about a population parameter. It is intended to
help researchers differentiate between real and random
patterns in the data.
43

What is a Hypothesis?
 An assumption
about the
population
parameter.
I assume the mean SBP of
participants is 120 mmHg
44

Null & Alternative Hypotheses
 H0 Null Hypothesis states the Assumption to be tested e.g. SBP of
participants = 120 (H0: m = 120).
 H1 Alternative Hypothesis is the opposite of the null hypothesis (SBP of
participants ≠ 120 (H1: m ≠ 120). It may or may not be accepted and it is
the hypothesis that is believed to be true by the researcher
45

Level of Significance, a
 Defines unlikely values of sample statistic if null hypothesis is
true. Called rejection region of sampling distribution
 Typical values are 0.01, 0.05
 Selected by the Researcher at the Start
 Provides the Critical Value(s) of the Test
46

0
a Critical
Value(s)
Rejection
Regions
Level of Significance and the Rejection Region47

H0: Innocent
Jury Trial Hypothesis Test
Actual Situation Actual Situation
Verdict Innocent Guilty Decision H0 True H0 False
Innocent Correct Error
Accept
H0
1 - a
Type II
Error (b )
Guilty Error Correct
H0
Type I
Error
(a )
Power
(1 - b)
False
Negative
False
Positive
Reject
Result Possibilities48

Hypothesis Testing: Steps
 Test the Assumption that the true mean SBP of participants is 120 mmHg.
 State H0 H0 : m = 120
 State H1 H1 : m  120
 Choose a a = 0.05
 Choose n n = 100
 Choose Test: Z, t, X2 Test
49

Hypothesis Testing: Steps
 Compute Test Statistic
 Search for Critical Value
 Make Statistical Decision rule
 Express Decision
50

One sample-mean Test
 Assumptions
 Population is normally distributed
 t test statistic
n
s
x
t 0
errorstandard
valuenullmeansample 



51

Example Normal Body Temperature
 What is normal body temperature? Is it actually 37.6oC (on average)?
State the null and alternative hypotheses
 H0:  = 37.6oC
 Ha:   37.6oC
52

Example Normal Body Temp (cont)
n
s
x
t 0
errorstandard
valuenullmeansample 



Data: random sample of n = 18 normal body temps
37.2 36.8 38.0 37.6 37.2 36.8 37.4 38.7 37.2
36.4 36.6 37.4 37.0 38.2 37.6 36.1 36.2 37.5
Variable n Mean SD SE t P
Temperature 18 37.22 0.68 0.161 2.38 0.029
Summarize data with a test statistic
53

STUDENT’S t DISTRIBUTION TABLE
Degrees of
freedom
Probability (p value)
0.10 0.05 0.01
1 6.314 12.706 63.657
5 2.015 2.571 4.032
10 1.813 2.228 3.169
17 1.740 2.110 2.898
20 1.725 2.086 2.845
24 1.711 2.064 2.797
25 1.708 2.060 2.787
 1.645 1.960 2.576
54

 Find the p-value
 df = n – 1 = 18 – 1 = 17
 From SPSS: p-value = 0.029
 From t Table: p-value is between
0.05 and 0.01.
 Area to left of t = -2.11 equals
area to right of t = +2.11.
 The value t = 2.38 is between
column headings 2.110& 2.898
in table, and for df =17, the p-
values are 0.05 and 0.01.
-2.11 +2.11 t
55

 Decide whether or not the result is statistically significant based on the p-
value
 Using a = 0.05 as the level of significance criterion, the results are
statistically significant because 0.029 is less than 0.05. In other words, we
can reject the null hypothesis.
 Report the Conclusion
 We can conclude, based on these data, that the mean temperature in the
human population does not equal 37.6.
56

Case Study: - STATISTICAL INFERENCE OF A CASE
STUDY IN CHINA: ACTIVE PHOSPHATE REMOVAL
FROM EUTROPHIC WATER
 China is a country that exports a huge amount of duck meat. Recently, more and
more people raise ducks in ponds together with fish. Previous research has shown
that the yield of fish in a duckfish integrated system pond is greater than the yield
in non-integrated system ponds.
 At the same time, the duck-fish system reduced the pollution significantly.
However, there is still polluted water left due to the entering phosphorous and
nitrate from ducks (Adel K. Soliman, 2000)
57

Experimental Design and Sample
Collection
 Experimental Design and Sample Collection
 The experiments were performed in Anhui, China. Three ponds, A, B and C, were
selected.
 Pond A is our treatment pond where we planted the water spinach. It had ducks
and fishes. Pond B is a pond with ducks and fishes without water spinach. Pond C is
the control pond with fishes only. We built a floating bed of size 5m*1.2m to fix the
water spinach in pond A.
58

Sample Collection
 The samples were obtained each of these locations in three ponds:
 A1: concentration from water within water spinach area in pond A;
 A2: concentration from water outside water spinach area in pond A;
 B1: concentration from water under duck sheds in pond B;
 B2: concentration from water away from duck sheds in pond B;
 C1: concentration from water in pond C without duck or water spinach.
59

Data Analysis Result
 When we plot the measurements from a same pond, we get Figure 3. The
observations for both ammonia nitrogen and active phosphate from A1
continuously decrease. The observations for ammonia-nitrogen from C1 do not
show decreasing trend.
60

Conclusion
 We performed multiple paired t-test to compare the mean concentrations of
ammonia-nitrogen at various locations. The p-value between samples from A1 and
A2 is greater than 0.1, so there is no real difference in the concentration of
ammonia-nitrogen at two different locations within pond A.
 There is a real difference in the concentration of active phosphate at two different
locations within pond B. The significance test recommends that the water near the
ducks is more polluted by the active phosphate content than the water elsewhere.
The p-value between samples from A1 and B2 is close to 0.0005, so there is a
significant evidence that planting water spinach reduces the active phosphate
content in the water.
61

Reference
 CR Kothari - Research Methodology Methods and Techniques , 2nd Revised edition,
New Age International Publishers.
 June Luo, Ling Zu - STATISTICAL INFERENCE OF A CASE STUDY IN CHINA: ACTIVE
PHOSPHATE REMOVAL FROM EUTROPHIC WATER, Department of Applied
Economics and Statistics, Clemson University.
 https://www.slideshare.net/rambhu21/sampling-and-sampling-errors-19870549/62
62

THANK YOU For Bearing.
Bhavik A. Shah (17TS809)
63

Sampling and statistical inference

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sampling and statistical inference

Similar to Sampling and statistical inference (20)

More from Bhavik A Shah

More from Bhavik A Shah (20)

Recently uploaded

Recently uploaded (20)

Sampling and statistical inference