SlideShare a Scribd company logo
1 of 30
Download to read offline
Sampling
and
Sampling Distribution
by
Umesh K. Pandey
1
Sampling
and
Sampling Distribution

Umesh K. Pandey
M.Sc., MBA, MBL, DIPC, Ph.D.
© The Author
Important notice on reuse, reproduction or commercial use:
Complete reproduction without alteration of the content, partial or as a whole, is permitted for non-commercial, personal
and academic purposes without a prior permission provided such reproduction includes full citation of the article, an
acknowledgement of the copyright and link to the article. The author should be informed about this use if more than one
copy is being made or the content, partial or as a whole, is being reproduced on a website, intranet or any other electronic
media.
Contents of this ebook, partial or as a whole, should not be included in a framed web page.
Contents of this site, partial or as a whole, should not be included in a password protected site or a site which requires
registration, even if free.
Contents of this ebook, partial or as a whole, should not be included in a site which charges for other contents but provides
the content from this site for free.
A site or any electronic media reproducing content from this ebook if includes advertisements placed along with our content
or generates any form of revenue due to the contents of this ebook, should share the revenue with the author.
For seeking permission for commercial reuse please contact author at yukaypee@hotmail.com
2
INDEX
Sampling and Sampling Distribution
Page No
1. Sampling 3
1.1 What is sampling ? 3
1.2 Why Sampling instead of Census? 4
1.3 Sampling methods 7
1.3.1 Probability Sampling Methods 7
1.3.1.1 Simple Random Sampling 8
1.3.1.2 Systematic Sampling 9
1.3.1.3 Stratified 10
1.3.1.4 Cluster Sampling 12
1.3.2 Non-Probability Sampling Methods 13
1.3.2.1 Convenience Sampling 14
1.3.2.2 Purposive Sampling 14
1.3.2.2.1 Judgment Sampling 15
1.3.2.2.2 Quota Sampling 15
2. Sampling Distribution 16
2.1 Sampling Distribution of the Mean 18
2.2 The Central Limit Theorem 22
2.3 Sampling Distribution of the variance 23
2.4 The Chi-square Distribution 24
2.5 Sampling Distribution of the proportion 26
2.6 The Confidence Level 27
3. Bibliography 29
***
3
1. Sampling
When managers use research, they are applying the methods of science to the art
of management. Business operates in the world of uncertainity and there is no unique
method which can entirely eliminate this uncertainity. Nevertheless, the research
methodology can indeed minimise the extent of uncertainity and can reduce the probability
of making a wrong choice amongst alternative courses of action. Therefore, the increasingly
complex nature of of business and governance focusses more and more attention on the
use of research methodology in solving managerial problems. In the prevailing highly
involved environment neither a business decision nor a governmental decision can be made
casually or based on intutions.
It is through appropriate data
and their analysis that the
decision maker becomes
equipped with proper tools of
decision making. Needless to
say the the credibility of the
results derived from the
application of such methodology
is dependent upon the reliability of the data included in the analysis.
The quantitative tool of inferential statistics is extensively used to address
managerial and business problems by using the relevant data. The inferential statistics are
the quantitative tools that use samples to estimate something about a population
parameter above what can possibly happen by chance. Good research is only as good as the
design, methods, and statistics used. Yet, the design, methods, and statistics are useless if
first, an optimal sample is not used. Thus sampling is the corner stone of any business
research.
1.1 What is Sampling ?
The terminology “sampling” indicates the selection of a part of a group or an
aggregate with a view to obtaining information about the whole. This aggregate or the
POPULATION SAMPLE
STATISTICPARAMETER
Sampling
Estimation
Inference
Figure 1: Research Methodology
4
totality of all members is known as Population although they need not be human beings.
The selected part, Which is used to ascertain the characteristics of the population is called
Sample. While choosing a sample, the population is assumed to be composed of individual
units or members, some of which are included in the sample. The total number of members
of the population and the number included in the sample are called Population Size and
Sample Size respectively. The concept can be shown through the following venn diagram
where the population is an universal set and sample is shown as a true subset.
Population: Set of all items
Sample: Set of chosen items
The process of generalising on the basis of information collected on a part is really a
traditional practice. With the advancement of management science more sophisticated
applications of sampling in business and industry are available. Sampling methodology can
be used by an auditor or an accountant to estimate the value of total inventory in the stores
without actually inspecting all the items physically. Opinion polls based on samples are used
to forecast the result of a forthcoming election.
1.2 Why Sampling instead of Census?
The census or complete enumeration consists in collecting data from each and every
unit from the population. The sampling only chooses a part of the units from the population
for the same study. The sampling has a number of advantages as compared to complete
enumeration due to a variety of reasons.
Cost
The first obvious advantage of sampling is that it is less expensive. If we want to study
the consumer reaction before launching a new product it will be much less expensive to
Figure 2: Population & Sample
5
carry out a consumer survey based on a sample rather than studying the entire population
which is the potential group of customers. Although in decennial census every individual is
enumerated, certain aspects of the population are studied on a sample basis with a view to
reduce cost.
Time
The smaller size of the sample enables us to collect the data more quickly than to
survey all the units of the population even if we are willing to spend money. This is
particularly the case if the decision is time bound. An accountant may be interested to know
the total inventory value quickly to prepare a periodical report like a quarterly balance sheet
and a profit and loss account. A detailed study on the inventory is likely to take too long to
enable him to prepare the report in time. If we want to measure the Consumer Price Index
in a particular month we cannot collect data of all the consumer prices even if the
expenditure is not a hinderance. The collection of data on all the consumer items and their
processing in all probability are going to take pretty long time. Thus when ready, the price
index will not serve any meaningful purpose.
Accuracy
It is possible to achieve greater accuracy by using appropriate sampling techniques
than by a complete enumeration of all the units of the population. Contrary to the common
belief, complete enumeration may result in inaccuracies of the data owing to the fatigue of
the enumerator or spurious & unreliable data collected in view of large volume. On the
other hand, if a small number of items is observed the basic data will be much more
accurate. It is of course true that the conclusion about a population characteristic such as
the proportion of defective items from a sample will also introduce error in the system.
However, such errors, known as sampling errors, can be studied, controlled and probability
statements can be made about their magnitude. The accuracy which results due to fatigue
of the inspector is known as non sampling error. It is difficult to recognise the pattern of the
non sampling error and it is not possible to make any comment about its magnitude even
probabilistically.
6
Reliability of Inference
In many cases, sampling provides adequate information so that not much additional
reliability can be gained with complete enumeration in spite of spending large amounts of
additional money and time. It is also possible to quantify the magnitude of possible error on
using some types of sampling which is not the case in census approach.
Impossibility of complete enumeration
In many situations the item being studied gets destroyed while being tested.
Sampling is indispensable under such circumstances.if one is interested in computing the
average life of Compact Fuoroscent Lamps (CFL) supplied in a batch, the life of entire batch
cannot be examined to compute the average life since this means that the entire supplywill
be wasted. Thus in such cases there is no other alternative than to examine the life of a
sample of CFLs and draw an inference about the entire batch.
Infeasibility of complete enumeration
More often than not, it is practically infeasible to do a complete enumeration due to
many practical difficulties. For example, if a shaving gel manufacturer wants to launch a new
& improved version of its gel. For getting consumer feedback the manufacturer distributed
old version of gel to say 500 consumers and after a week or so replaced it with the new
version to get feedback on various attributes of the product. In this situation, it would be
infeasible to collect information from all the consumers of shaving gel in India. Some
consumers would have moved from one place to another during the period of study, some
others would have stopped consuming shaving gel just before the period of study whereas
some others would have been users of shaving gel during the period of study but would
have stopped using it some time later. In such situations, although it is theoretically possible
to do a complete enumeration, it is practically infeasible to do so.
The above account clearly establishes that a research study gives more reliable
results at a greater convenience by way of sampling as compared with the study of the entire
population.
7
1.3 Sampling methods
Good research is as only as
good as the design, methods,
and statistics used. Yet, the
design, methods, and
statistics are useless if first,
we don’t use an optimal
sample. A Sampling frame is a
list of all the units of the
population. The preparation
of a sampling frame should
always be upto date and be free from errors of omission and duplication of sampling units.
A perfect frame identifies each element once and only once. Perfect frames are seldom
available in real life. Nevertheless, it needs to be ensured that the sampling frame is
complete, accurate, adequate and up-to-date.
Further, depending on the requirement of various possibilities in research sampling
methods are broadly categorized into two groups viz Probability sampling methods and
Non probability sampling methods, as depicted in Figure 3.
1.3.1 Probability Samplling Methods
In probability sampling methods the population from which the sample is drawn should be
known to the researcher. Under this sampling design every item of the population has an
equal chance of inclusion in the sample. Lottery methods or selecting a student from the
complete students names from a box with blind or folded eyes is the best example of
random sampling, it is the best technique and unbiased method. It is the best process of
selecting representative sample.But the major disadvantage is that for this technique we
need the complete sampling frame i.e. the list of the complete items or population which is
not always available.
The probability sampling methods are of four types viz Simple Random Sampling, Systematic
Sampling, Stratified Sampling and Cluster Sampling.
Sampling
Probability Non-probability
Simple Random
Systematic
Stratified
Cluster
Convenience
Purposive
Quota
Judgment
Figure 3: Sampling Methods
8
1.3.1.1 Simple Random Sampling
Simple random sampling is based on the concept of probability. The use of probability in
sampling theory makes it a reliable tool to draw inference or conclusion about the
population. Although the types of conclusion or inference can be quite diverse, two
particular types of decision making are quite prevalent in problems of business and
government.
On various occasions, the management would like to know the percentage or proportion of
units in the population with a certain characteristic. An organisation selling consumer
product may like to know the proportion of potential consumers using a certain type of
cosmetic. The government may like to know the percent of small farmers owning some
cultivable land in a rural region. A manufacturer planning to export some product may be
interested to ascertain the proportion of defect free units his system is capable of
manufacturing.
The representative character of a sample is ensured by allocating some probability to each
unit of the population for being included in the sample. The simple random sample assigns
equal probability to each unit of the population. The simple random sample can be chosen
both with and without replacement.
Simple Random Sampling with Replacement
Suppose the population consists of N units and we want to select a sample of size n. In simple
random sampling with replacement we choose an observation from the population in such
a manner that every unit of the population has an equal chance of 1/N to be included in the
sample. After the first unit is selected its value is recorded and it is again placed back in the
population. The second unit is drawn exactly in the same manner as the first unit This
procedure is continued until nth unit of the sample is selected. Obviously, in this case each
unit of the population has an equal chance of 1/N to be included in each of the n units of
the sample.
9
Simple Random Sampling without Replacement
In this case when the first unit is chosen every unit of the population has a chance of 1/N to
be included in the sample. After the first unit is chosen it is no longer replaced in the
population. The second unit is selected from the remaining ‘N-1’ members of the population
so that each unit has a chance of
𝟏𝟏
𝑵𝑵−𝟏𝟏
to be included in the sample. The procedure is
continued till nth unit of the sample is chosen with probability
𝟏𝟏
[𝑵𝑵−𝒏𝒏+𝟏𝟏]
.
Random numbers for simple random sampling are generated using probabilistic mechanism.
1.3.1.2 Systematic Sampling
Systematic sampling involves selecting items
using a constant interval between the selections
depending on the sampling ratio – first interval
having a random start. For example, if a sample
of size 10 from a population of size 100 is
required, the sampling ratio would be n/N =
10/100= 1/10. It would, therefore, have to be decided where to start from among the first
10 names in our sampling frame. If this number happens to be 5 for example, then the
sample would contain members having serial numbers 5, 15, 25, 35, ……. 95 in the frame. It
is noteworthy that the random process establishes only the first member of the sample -
the rest are pre-determined because of the known sampling ratio. Usually the starting serial
number of sample is decided by allowing chance to play its role by using a table of random
numbers. In other words, the sampling starts by selecting an element from the list at
random and then every kth element in the frame is selected, where k, the sampling interval
(sometimes known as the skip): this is calculated as 𝒌𝒌 =
𝑵𝑵
𝒏𝒏
where n is the sample size, and N
is the population size.
Systematic sampling is relatively much easier to implement compared to simple random
sampling. However, there is one possibility that should be guarded against while using
systematic sampling - the possibility of a strong bias in the results if there is any periodicity
in the frame that parallels the sampling ratio. For example if someone were making studies
Figure 4: Systematic Sampling
10
on the demand for various banking transactions in a bank branch by studying the demand
on some days randomly selected by systematic sampling and the chosen sampling ratio is
1/7 or 1/14 etc, he would always be studying the demand on the same day of the week and
the inferences could be biased depending on whether the day selected is a Monday or a
Friday and so on.
If the frame is arranged in an order, ascending or descending, of some attribute then the
location of the first sample element may affect the result of the study. For example, if the
frame contains a list of students arranged in a descending order of their percentage in the
previous examination and we are picking a systematic sample with a sampling ratio of 1/50.
If the first number picked is 1 or 2, then the sample chosen will be academically much better
off compared to another systematic sample with the first number chosen as 49 or 50. In
such situations, one should devise ways of nullifying the effect of bias due to starting number
by insisting on multiple starts after a small cycle or other such means.
On the other hand, if the frame is so arranged that similar elements are grouped together,
then systematic sampling produces almost a proportional stratified sample and would be,
therefore, more statistically efficient than simple random sampling.
Systematic sampling is perhaps the most commonly used method among the probability
sampling designs and for many purposes e.g. for estimating the precision of the results,
systematic samples are treated as simple random samples.
1.3.1.3 Stratified Sampling
The simple random sampling
may not always provide a
representative snapshot of the
population. Certain segments of
a population can easily be under
represented when an
unrestricted random sample is
chosen. Hence, when
considerable heterogeneity is present in the population with regard to subject matter under
n1
N1
Stratum-1 Stratum-pStratum-2
N2
Np
n2 np
Figure 5: Sratified Sampling
11
study, it is often a good idea to divide the population into segments or strata and select a
certain number of sampling units from each stratum thus ensuring representation from all
relevant segments. Thus for designing a suitable marketing strategy for a consumers
durable, the population of consumers may be divided into strata by income level and a
certain number of consumers can be selected randomly from each strata.
Therefore, in stratified random sampling the population is first divided into different
homogeneous group or strata which may be based upon a single criterion such as male or
female. Or upon combination of more criteria like sex, caste, level of education and so on.
This method is generally applied when different category of individuals constitutes the
population viz General, OBC, SC, ST or upper income, middle income, lower income or small
farmers, big farmers, marginal farmers landless farmers etc. To have an actual picture of a
particular population about the standard of living, in case of India it is advisable to
categorized the population on the basis of caste, religion or land holding otherwise some
section may be under-represented or not represented at all.
Stratified random sampling may be of Proportionate Stratified Random Sampling or Dis-
Proportionate Stratified Random Sampling.
Proportionate Stratified Random Sampling
In case of proportionate random sampling method, the researcher stratifies the population
according to known characteristics and subsequently, randomly draws the sample in a
similar proportion from each stratum of the population according to its proportion. That is,
the population is divided into several sub-populations depending upon some known
characteristics, this sub population is called strata and they are homogeneous. For example,
a town area committee consists of 15000 voters among which 60% are Hindus, 30% are
Muslims and 10% are others and the researcher wants to draw a sample of 300 voters from
the population as per their proportion. That can be done by multiplying the sample number
with their proportion; as per this method the sample size of Hindu voter will be 300 x 60% =
180, Muslims will be 300 x 30% = 90 and others will be 300 x 10% = 30. So the researcher
has to collect the complete voter list of the town and randomly select the sample from each
category as calculated above. In this method the sampling error is minimized and the sample
possesses all the required characteristics of the population.
12
Disproportionate Stratified Random Sampling
In this method the sampling unit in each stratum is not necessarily be as per their
population. Suppose for the said town the researcher wants to the know the voting pattern
of male and female of Hindu, Muslim and other voters; in that case he must take equal no.
of male and female voter from each category. Here the investigator has to give equal
weightage to each stratum. This is a biased type of sampling and in this case some stratum
is over-represented and some are less-represented; these are not truly representative
sampling, still this to be used in some special cases.
If the different strata in the population have unequal variances of the characteristic being
measured, then the sample size allocation decision should consider the variance as well. It
would be logical to have a smaller sample from a stratum where the variance is smaller than
from another stratum where the variance is higher. In fact, if 𝜎𝜎1
2
, 𝜎𝜎2
2
, … … , 𝜎𝜎𝑝𝑝
2
are the variance
of the p strata, then the statistical efficiency is the highest when –
𝑛𝑛1
𝑁𝑁1 𝜎𝜎1
=
𝑛𝑛1
𝑁𝑁2 𝜎𝜎2
= ⋯ =
𝑛𝑛𝑝𝑝
𝑁𝑁𝑝𝑝 𝜎𝜎𝑝𝑝
1.3.1.4 Cluster sampling
This is another type of probability sampling method, in which the sampling units are not individual
elements of the population, but group of elements or group of individuals are selected as sample. In
cluster sampling the total population is divided into a number of relatively small sub-divisions or
groups which are themselves clusters and then some of these cluster are randomly selected for
inclusion in the sample. Suppose a researcher wants to study the functioning of mid day meal service
in a district in that case he can use some schools clustering in a block or two without selecting the
schools scattering all over the district. Cluster sampling reduces the cost and labour of collecting the
data of the researcher but less precise than random sampling.
We can now compare Cluster Sampling with Stratified Sampling. Stratification is done to make the
strata homogeneous within and different from other strata. Clusters, on the other hand, should be
heterogeneous within and the different clusters should be similar to each other. A clusture, ideally,
is a mini-population and has all the features of the population.
The criterion used for stratification is a variable which is closely associated with the characteristic
we are measuring e.g. income level when we are measuring the family consumption of non-aerated
13
beverages. On the other hand, convenience of data collection is usually the basis for cluster
definitions. Geographic contiguity is quite often used for clusture definitions and in such cases,
cluster sampling is also known as Area Sampling.
There are very fewer strata and one requires to pick up a random sample from each of the strata for
drawing inferences. In cluster sampling, there are many clusters out of which only a few are picked
up by random sampling and then the clusters are completely enumerated.
Multi-stage and Multi-phase Sampling
In this method sampling is drawn more than once . This is used in most of the large surveys where
the sampling unit is something larger than an individual element of the population in all stages but
the final. For example, in a national survey on the demand of fertilizers one might use stratified
sampling in the first stage with a district as a sampling unit and the average rainfall in the district as
the criterion for stratification. Having obtained 20 districts from this stage, cluster sampling may be
used in the second stage to pick up 10 villages in each of the selected districts. Finally, in the third
stage, stratified sampling may be used in each village to pick up frames in each of the strata defined
with land holding as the criterion.
Multi-phase sampling, on the other hand, is designed to make use of the information collected in
one phase to develop a sampling design in a subsequent phase. A study with two phases is often
called Double Sampling. The first phase of the study might reveal a relationship between the family
consumption of non-aerated beverages and the family income and this information would then be
used in the second phase to stratify the population with family income as the criterion.
1.3.2 Non Probability Sampling Methods
Probability sampling has some theoretical advantages over non-probability sampling. The bias
introduced due to sampling could be completely eliminated and it is possible to set a confidence
interval for the population parameter that is being studied. In spite of these advantages of
probability sampling, non-probability sampling is used quite frequently in many sampling surveys.
This is so because all are based on practical considerations.
Probability sampling requires a list of all the sampling units and this frame is not available in many
situations nor is it practically feasible to develop a frame of say all the households in a city or zone
or ward of a city. Sometimes the objective of the stuc may not be to draw a statistical inference
about the population but to get familiar wit extreme cases or other such objectives. In a dealer
survey, our objective may be to get familiar with the problems faced by our dealers so that we can
14
take some corrective actions, wherever possible. Probability sampling is rigorous and this rigour e.g.
in selecting samples, adds to the cost of the study. And finally, even when we are doing probability
sampling, there are chances of deviations from the laid out process especially where some samples
are selected by the interviewers at site - say after reaching a village. Also, some of the sample
members may not agree to be interviewed or not available to be interviewed and our sample may
turn out to be a non-probability sample in the strictest sense of the term.
1.3.2.1 Convenience Sampling
In this type of non-probability sampling, the choice of the sample is left completely to the
convenience of the researcher. The cost involved in picking up the sample is minimum and the cost
of data collection is also generally low, e.g. the researcher can go to same retail shops and interview
some shoppers while studying the demand for some commodity.
Another form of convenience sampling is known as ‘Snow Ball Sampling’. This is a sociometric
sampling technique generally used to study the small groups. All the persons in a group identify their
friends who in turn know their friends and colleagues, until the informal relationships converge into
some type of a definite social pattern. It is like a snow ball increasing its size as it rolls down an ice-
field. For example in case of research regarding drug addict people it is difficult to find out who are
the drug users but when one person is identified he can tell the names of his partners then each of
his partner can tell another 2 or 3 names whom he knows uses drug . This way the required number
of elements/persons are identified and data is collected. This method is suitable for diffusion of
innovation, network analysis, decision making.
However, such samples can suffer from excessive bias from known or unknown sources and also
there is no way that the possible errors can be quantified.
1.3.2.2 Purposive Sampling
In convenience sampling, any member of the population can be included in the sampl without any
restriction. When some restrictions are put on the possible inclusion of a member in the sample, the
sampling is called purposive. This is a non random sampling method where the researcher selects
the sample arbitrarily which he considers important for the research and believes it as typical and
representative of the population. Say, a researcher wants to forecast the chance of coming into the
power of a political party in general election. He may select some reporters, some teachers and some
elite people of the territory and collect their opinions for the purpose of his study. He considers
those are the leading persons and their view are relevant for the chance of coming in to the power
15
of the party. As it is a purposive method it has big sampling errors and carry misleading conclusion.
The purposive sampling is broadly of two types viz Judgment Sampling and Quota Sampling.
1.3.2.2.1 Judgment Sampling
In judgment sampling, the judgment or opinion of some experts forms the basis for sample selection.
The experts are persons who are believed to have information on the population which can help in
giving us better samples. Such sampling is very useful when we want to study rare events, or when
members have extreme positions, or even when the objective of the study is to collect a wide cross-
section of views from one extreme to the other.
1.3.2.2.2 Quota Sampling
Even while using non-probability sampling, one might want our sample to be representative of the
population in some defined ways. This is sought to be achieved in quota sampling so that the bias
introduced by sampling could be reduced.
If in a given population, 25% of the members belong to the high income group, 25% to the middle
income group, 35% to the low income group and 15 % are Below Poverty Line (BPL) and we are using
quota sampling, we would specify that the sample should also contain members in the same
proportion as in the population e.g. 15% of the sample members would belong to the BPL group
and so on.
The criteria used to set quotas could be many. For example, family size could be another criterion
and we can set quotas for families with family size upto 3, between 4 & 5, and above 5. However, if
the number of such criteria is large, it becomes difficult to locate sample members satisfying the
combination of the criteria. In such cases, the overall relative frequency of each criterion in the
sample is matched with the overall relative frequency of the criterion in the population.
This method of sampling is almost same with that of stratified random sampling as stated above, the
only difference is that here in selecting the elements randomization is not done instead quota is
taken into consideration. As quota sampling is not random so sampling method is biased and lead to
large sampling errors.
16
2. The Sampling Distribution
Sample statistics form the basis of all inferences drawn about populations. If we know the probability
distribution of the sample statistic, then we can calculate the probability that the sample statistic
assumes a particular value (if it is a discrete random variable) or has a value in a given interval. This
ability to calculate the probability that the sample statistic lies in a particular interval is the most
important factor in all statistical inferences. Let’s demonstrate this by an example.
Suppose we know that 55% of the population of all users of Shampoo prefer brand ‘A’ to the next
competing brand. A “new improved” version of ‘A’ has been developed and given to a random
sample of 200 shampoo users for use. If 120 of these prefer the “new improved” version to the next
competing brand, what should one conclude? For an answer, we would like to know the probability
that the sample proportion in a sample of size 200 is as large as 60% or higher when the true
population proportion is only 55%, i.e. assuming that the new version is no better than the old. If
this probability is quite large, say 0.5, we might conclude that the high sample proportion viz. 60% is
perhaps because of sampling errors and the new version is not really superior to the old. On the
other hand, if this probability works out to a very small figure, say 0.001, then we might conclude
that the true population proportion is higher than 55%, i.e. the new version is actually superior to
the old one as perceived by members of the population. To calculate this probability, we need to
know the probability distribution of sample proportion or the sampling distribution of the
proportion.
The sampling distribution, thus, is a distribution of a sample statistic. It is a model of a distribution
of scores, like the population distribution, except that the scores are not raw scores, but statistics. It
is a thought experiment; "what would the world be like if a person repeatedly took samples of size
N from the population distribution and computed a particular statistic each time?" The resulting
distribution of statistics is called the sampling distribution of that statistic.
For example, suppose that a sample of size sixteen (N=16) is taken from some population. The mean
of the sixteen numbers is computed. Next a new sample of sixteen is taken, and the mean is again
computed. If this process were repeated an infinite number of times, the distribution of the now
infinite number of sample means would be called the sampling distribution of the mean. Similarly,
every statistic has a sampling distribution.
Just as the population models can be described with parameters, so can the sampling distribution.
The expected value (analogous to the mean) of a sampling distribution will be represented here by
17
the symbol µ. The µ symbol is often written with a subscript to indicate which sampling distribution
is being discussed. For example, the expected value of the sampling distribution of the mean is
represented by the symbol 𝜇𝜇𝑥𝑥̅, that of the median by 𝜇𝜇 𝑀𝑀𝑑𝑑
, etc. The value of 𝜇𝜇𝑥𝑥̅ can be thought of as
the mean of the distribution of means. In a similar manner the value of 𝜇𝜇 𝑀𝑀𝑑𝑑
is the mean of a
distribution of medians. They are not really means, because it is not possible to find a mean when
𝑁𝑁 = ∞, but are the mathematical equivalent of a mean.
Using advanced mathematics, in a thought experiment, the theoretical statistician often discovers a
relationship between the expected value of a statistic and the model parameters. For example, it
can be proven that the expected value of both the mean and the median, 𝑋𝑋� and Md, is equal to µ
x .
When the expected value of a statistic equals a population parameter, the statistic is called an
unbiased estimator of that parameter. In this case, both the mean and the median would be an
unbiased estimator of the parameter 𝜇𝜇𝑥𝑥̅.
A sampling distribution may also be described with a parameter corresponding to a variance,
symbolized by 𝜎𝜎2
. The square root of this parameter is given a special name, the standard error.
Each sampling distribution has a standard error. In order to keep them straight, each has a name
tagged on the end of the word "standard error" and a subscript on the σ symbol. The standard
deviation of the sampling distribution of the mean is called the standard error of the mean and is
symbolized by 𝜎𝜎𝑥𝑥̅. Similarly, the standard deviation of the sampling distribution of the median is
called the standard error of the median and is symbolized by 𝜇𝜇 𝑀𝑀𝑑𝑑
.
In each case the standard error of a statistics describes the degree to which the computed statistics
will differ from one another when calculated from sample of similar size and selected from similar
population models. The larger the standard error, the greater the difference between the computed
statistics. Consistency is a valuable property to have in the estimation of a population parameter, as
the statistic with the smallest standard error is preferred as the estimator of the corresponding
population parameter, everything else being equal. Statisticians have proven that in most cases the
standard error of the mean is smaller than the standard error of the median. Because of this
property, the mean is the preferred estimator of 𝜇𝜇𝑥𝑥.
In practice, we refer to the sampling distributions of only the commonly used sampling statistics like
the sample mean, sample variance, sample proportion, sample median etc., which have a role in
making inferences about the population.
18
2.1 The Sampling Distribution of the Mean
There are many (infinite!) possible values of the sample mean and the particular value that we
obtain, if we pick up only one sample, is determined only by chance. The distribution of the sample
mean is also referred to as the sampling distribution of the mean.
However, to observe the distribution of x empirically, we have to take many samples of size n and
determine the value of x for each sample. Then, looking at the various observed values of x, it might
be possible to get an idea of the nature of the distribution.Such sampling distribution of the mean is
known as distribution of sample means. This distribution is described with the parameters 𝜇𝜇𝑥𝑥̅ and
𝜎𝜎𝑥𝑥̅ .
Sampling from Infinite Populations
Let’s study two cases –
1. Where the population is infinitely large or when the sampling is done with replacement
2. Where the population is finite and we are sampling without replacement
For the first scenario let’s assume we have a population which is infinitely large and having a
population mean of µ . and a population variance of 𝜎𝜎2
. This implies that if x is a random variable
denoting the measurement of the characteristic that we are interested in, on one element of the
population picked up randomly, then the expected value of x, E(x) = µ and the variance of x, Var (x)
= 𝜎𝜎2
The sample mean, 𝑥𝑥̅ , can be looked at as the sum of n random variables, viz x1, x2,..., xn, each being
divided by (1/n). Here x1, is a random variable representing the first observed value in the sample,
x2 representing the second observed value and so on. Now, when the population is infinitely large,
whatever be the value of x1, the distribution of x2 is not affected by it. This is true of any other pair
of random variables as well. In other words x1, x2,..., xn are independent random variables and all are
picked up from the same population.
∴ 𝐸𝐸(𝑥𝑥1) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥1) = 𝜎𝜎2
𝐸𝐸(𝑥𝑥2) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥2) = 𝜎𝜎2
… and so on
Finally,
𝐸𝐸(𝑥𝑥̅) = 𝐸𝐸 �
(𝑥𝑥1+𝑥𝑥2+⋯+𝑥𝑥𝑛𝑛)
𝑛𝑛
�
19
=
1
𝑛𝑛
𝐸𝐸(𝑥𝑥1) +
1
𝑛𝑛
𝐸𝐸(𝑥𝑥2) + ⋯ +
1
𝑛𝑛
𝐸𝐸(𝑥𝑥𝑛𝑛)
=
1
𝑛𝑛
𝜇𝜇 +
1
𝑛𝑛
𝜇𝜇 + ⋯ +
1
𝑛𝑛
𝜇𝜇
= 𝜇𝜇
This means that the expected value of the sample mean is the same as the population mean.
and Var(𝑥𝑥̅)= 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥1+𝑥𝑥2+⋯+𝑥𝑥𝑛𝑛
𝑛𝑛
�
= 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥1
𝑛𝑛
� + 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥2
𝑛𝑛
� + ⋯ + 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥𝑛𝑛
𝑛𝑛
�
=
1
𝑛𝑛2
𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥1) +
1
𝑛𝑛2
𝑉𝑉𝑉𝑉𝑉𝑉 (𝑥𝑥2) + ⋯ +
1
𝑛𝑛2
𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥𝑛𝑛)
=
1
𝑛𝑛2 𝜎𝜎2
+
1
𝑛𝑛2 𝜎𝜎2
+ ⋯ +
1
𝑛𝑛2 𝜎𝜎2
=
𝜎𝜎2
𝑛𝑛
This says that the variance of the sample mean is the variance of the population divided by the
sample size.
If we take a large number of samples of size n, then the average value of the sample means tends to
be close to the true population mean. On the other hand, if the sample site is increased then the
variance of 𝑥𝑥̅ gets reduced and by selecting an appropriately large value of n. the variance of x can
be made as small as desired.
The standard deviation of 𝑥𝑥̅ is also called the standard error of the mean. Very often we estimate
the population mean by the sample mean. The standard error of the mean indicates the extent to
which the observed value of sample mean can be away from the true value, due to sampling errors.
For example, if the standard error of the mean is small, we are reasonably confident that
whatever sample mean value we have observed cannot be very far away from the true value. The
standard error of the mean is represented by 𝜎𝜎𝑥𝑥̅.
Sampling with replacement
The above results have been obtained under the assumption that the random variables 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛
are independent. This assumption is valid when the population is infinitely large. It is also valid when
20
the sampling is done with replacement, so that the population is back to the same form before the
next sample member is picked up.
Hence, if the sampling is done with replacement, we would again have-
𝐸𝐸(𝑥𝑥̅) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥̅) =
𝜎𝜎2
√ 𝑛𝑛
meaning thereby that 𝜎𝜎𝑥𝑥̅ =
𝜎𝜎
√ 𝑛𝑛
Sampling Without Replacement from Finite Populations
When a sample is picked up without replacement from a finite population, the probability
distribution of the second random variable depends on what has been the outcome of the first pick
and so on. As the n random variables representing the n sample members do not remain
independent, the expression for the variance of 𝑥𝑥̅ changes. Results of derivation for this situation
works out as under-
𝐸𝐸(𝑥𝑥̅) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥̅) = 𝜎𝜎𝑥𝑥
2
=
𝜎𝜎2
𝑛𝑛
.
𝑁𝑁−𝑛𝑛
𝑁𝑁−1
meaning thereby that 𝜎𝜎𝑥𝑥̅ =
𝜎𝜎
√ 𝑛𝑛
. �
𝑁𝑁−𝑛𝑛
𝑁𝑁−1
By comparing these expressions with the ones derived above we find that the standard error of 𝑥𝑥̅ is
the same but further multiplied by a factor �(𝑁𝑁 − 𝑛𝑛)/(𝑁𝑁 − 1) . This factor is, therefore, known as
the finite population multiplier.
In practice, almost all the samples used are picked up without replacement. Also, most populations
are finite although they may be very large and so the standard error of the mean should theoretically
be found by using the expression given above. However, if the population size (N) is large and
consequently the sampling ratio (n/N) small, then the finite population multiplier is close to 1 and is
not used, thus treating large finite populations as if they were infinitely large. For example, if N =
5,00,000 and n=500, the finite population multiplier -
�
𝑁𝑁−𝑛𝑛
𝑁𝑁−1
= �
5,00,000−500
5,00,000−1
= �
499500
499999
= √0.999002 = 0.9995 which is very close to 1 and the standard
error of the mean would, for all practical purposes, be the same whether the population is treated
as finite or infinite. As a rule of that, the finite population multiplier may not be used if the sampling
ratio (n/N) is smaller than 0.05.
Sampling from Normal Populations
It has been observed that the normal distribution occurs very frequently among many natural
phenomena. For example, heights or weights of individuals, the weights of filled-bags from an
automatic machine, the hardness obtained by heat treatment, etc. are distributed normally.
21
It is also known fact that the sum of two independent random variables will follow a normal
distribution if each of the two random variables belongs to a normal population. The sample mean,
as we have seen earlier is the sum of n random variables 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 each divided by n. Now, if
each of these random variables is from the same normal population, it is not difficult to see that 𝑥𝑥̅
would also be distributed normally.
Let 𝑥𝑥~𝑁𝑁(𝜇𝜇, 𝜎𝜎2) symbolically represent the fact that the random variable x is distributed normally
with mean n and variance 𝜎𝜎2
. Thus,
If 𝑥𝑥~𝑁𝑁(𝜇𝜇, 𝜎𝜎2) then it follows that 𝑥𝑥~𝑁𝑁 �𝜇𝜇,
𝜎𝜎
𝑛𝑛
2
�
The normal distribution is a continuous distribution and so the population cannot be small and finite
if it is distributed normally; that is why the finite population multiplier is not used in the above
expression. Let’s see, by an example, how to make use of the above result.
Suppose the weight of candy produced on a semi-automatic machine is known to be distributed
normally with a mean of 10 mg and a standard deviation of 0.1 mg. If we pick up a random sample
of size 5, what is the probability that the sample mean will be between 9.95 mg and 10.05 mg?
Let x be a random variable representing the weight of one candy picked up at random.
We know that 𝑥𝑥 − 𝑁𝑁( 10, 0.01)
Therefore, it follows that 𝑥𝑥̅~ 𝑁𝑁 �10,
0.01
5
�
This denots that 𝑥𝑥̅ will be distributed normally with a mean of 10 and a variance which is only 1/5
of the variance of the population, since the sample size is 5.
𝑃𝑃𝑃𝑃 {9.95 ≤ 𝑥𝑥̅ ≤ 10.05} = 2 × Pr{10 ≤ 𝑥𝑥̅ ≤ 10.05}
= 2 × Pr �
10−𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
≤
𝑥𝑥̅− 𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
≤
10.05−𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
�
= 2 × Pr �0 ≤ 𝑧𝑧 ≤
10.05−10
0.1
√5
�
�
= 2 × Pr{0 ≤ 𝑧𝑧 ≤ 1.12}
=2 × 0.3686
= 0.7372
22
Figure 6: Distribution of 𝒙𝒙� the enclosed area represents the probability of the random variable 𝒙𝒙� between 9.95 and 10.05
We first make use of the symmetry of the normal distribution and then calculate the z value by
subtracting the mean and then dividing it by the standard deviation of the random variable
distributed normally, viz 𝑥𝑥̅. The probability of interest is also shown as the enclosed area in Figure 6
above.
2.2 The Central Limit Theorem
The above parameters are closely related to the parameters of the population distribution, with the
relationship being described by the Central Limit Theorem. The Central Limit Theorem essentially
states that the mean of the sampling distribution of the mean (𝜇𝜇𝑥𝑥̅) equals the mean of the population
( 𝜇𝜇𝑥𝑥) and that the standard error of the mean (𝜎𝜎𝑥𝑥̅) equals the standard deviation of the population (
𝜎𝜎𝑥𝑥) divided by the square root of N as the sample size gets infinitely larger (𝑁𝑁 ≥ ∞). In addition, the
sampling distribution of the mean will approach a normal distribution. These relationships may be
summarized as follows:
𝜇𝜇𝑥𝑥= 𝜇𝜇𝑥𝑥̅ and 𝜎𝜎𝑥𝑥=
𝜎𝜎𝑥𝑥
√ 𝑁𝑁
It is observed that the sample size needs to be very large (∞) in order for these relationships to hold
true. In theory, this is fact; in practice, an infinite sample size is impossible.
In most situations encountered by researchers, the Central Limit Theorem works reasonably well
with an N greater than 10 or 20. Thus, it is possible to closely approximate what the distribution of
sample means looks like, even with relatively small sample sizes.
9.95 µ=1 10.05
𝜎𝜎𝑥𝑥̅ =
0.1
√5
𝑥𝑥̅ →
23
The importance of the Central Limit Theorem to statistical thinking cannot be overstated. Most of
hypothesis testing and sampling theory are based on this theorem. In addition, it provides a
justification for using the normal curve as a model for many naturally occurring phenomena. If a
trait, such as intelligence, can be thought of as a combination of relatively independent events, in
this case both genetic and environmental, then it would be expected that trait would be normally
distributed in a population.
We need to use the central limit theorem when the population distribution is either unknown or
known to be non-normal. If the population distribution is known to be normal, then 𝑥𝑥̅ will also be
distributed normally, irrespective of the sample size.
2.3 The Sampling Distribution of the Variance
Before attempting to discuss the sampling distribution of the variance, it is worthwhile to first
introduce the concept of sample variance and then present the chi-square distribution which helps
us in working out probabilities for the sample variance, when the population is distributed normally.
The Sample Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread
out. A variance of zero indicates that all the values are identical. A non-zero variance is always
positive: a small variance indicates that the data points tend to be very close to the mean (expected
value) and hence to each other, while a high variance indicates that the data points are very spread
out from the mean and from each other.
We use the sample mean to estimate the population mean, when that parameter is unknown.
Similarly , we use a sample statistic called the sample variance to estimate the population variance.
The sample variance is usually denoted by 𝑠𝑠2
and it again captures some kind of an average of the
square of deviations of the sample values from the sample mean. Let us put it in an equation form
𝑠𝑠2
=
∑ (𝑥𝑥𝑖𝑖−𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
𝑛𝑛−1
By comparing this expression with the corresponding expression for the population variance, we
notice two differences. The deviations are measured from the sample mean and not from the
population mean and secondly, the sum of squared deviations is divided by (n -1) and not by n.
Consequently, we can calculate the sample variance based only on the sample values without
knowing the value of any population parameter. The division by (n - 1) is due to a technical reason
to make the expected value of s2 equal 𝜎𝜎2
, which it is supposed to estimate.
24
2.4 The Chi-square Distribution
The 𝜒𝜒2
distribution is an asymmetric distribution that has a minimum value of 0, but no maximum
value. The curve reaches a peak to the right of 0, and then gradually declines in height, the larger
the 𝜒𝜒2
value is. The curve approaches, but never quite touches, the horizontal axis.
For each degree of freedom there is a different 𝜒𝜒2
distribution. The mean of the chi square
distribution is the degree of freedom and the standard devi-ation is twice the degrees of freedom.
This implies that the 𝜒𝜒2
distribution is more spread out, with a peak farther to the right, for larger
than for smaller degrees of freedom. As a result, for any given level of significance, the critical region
begins at a larger chi square value, the larger the degree of freedom.
In its graphical represntation the 𝜒𝜒2
value is on the horizontal axis, with the probability for each
𝜒𝜒2
value being represented by the vertical axis. The three lines in the diagram represents the pattern
of chi square for degrees of freedom as 1, 5 and 10 respectively.
Figure 7: Chi-square distribution with different degrees of freedom
If the random variable x has the standard normal distribution, what would be the distribution of 𝜒𝜒2
?
Intuitively speaking, it would be quite different from a normal distribution because now 𝜒𝜒2
, being a
squared term, can assume only non-negative values. The probability density of 𝜒𝜒2
will be the highest
near 0, because most of the value are close to 0 in a standard normal distribution. This distribution
is called the chi-square distribution with 1 degree of freedom.
The chi-square distribution has only one parameter viz. the degrees of freedom and so there are
many chi-square distributions each with its own degrees of freedom. In statistical tables, chi-square
values for different are.as under the right tail and the left tail of various chi-square distributions are
tabulated.
25
If 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 are independent random variables, each having a standard normal distribution, then
𝑥𝑥1 + 𝑥𝑥2 + ⋯ + 𝑥𝑥𝑛𝑛 will have a chi-square distribution with n degrees of freedom.
If 𝑦𝑦1 and 𝑦𝑦2 are independent random variables having chi-square distributions with 𝛾𝛾1 and 𝛾𝛾2
degrees of freedom, then (𝑦𝑦1 + 𝑦𝑦2) will have a chi-square distribution with 𝛾𝛾1 + 𝛾𝛾2 degrees of
freedom.
Further, if 𝑦𝑦1 and 𝑦𝑦2 are independent random variables such that 𝑦𝑦1 has a chi-square distribution
with 𝛾𝛾1 degrees of freedom and (𝑦𝑦1 + 𝑦𝑦2) has a chi-square distribution with 𝛾𝛾 > 𝛾𝛾1 degrees of
freedom, then 𝑦𝑦2 will have a chi-square distribution with (𝛾𝛾 − 𝛾𝛾1) degrees of freedom.
Now, if 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 are n random variables from a normal population with mean 𝜇𝜇 and variance
𝜎𝜎2
,
i.e. 𝑥𝑥𝑖𝑖~𝑁𝑁(𝜇𝜇, 𝜎𝜎2), 𝑖𝑖 = 1,2, … , 𝑛𝑛
it implies that
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
~𝑁𝑁(0,1)
and so �
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
�
2
will have a chi-square distribution with 1 degree of freedom.
Hence, ∑ �
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
�
2
𝑛𝑛
𝑖𝑖=1 will have a chi-square distribution with n degrees of freedom.
We can break up this expression by measuring the deviation from 𝑥𝑥 in place of 𝜇𝜇. We will then have
∑ �
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
�
2
𝑛𝑛
𝑖𝑖=1 =
1
𝜎𝜎2
∑ [(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅) + (𝑥𝑥̅ − 𝜇𝜇)]2𝑛𝑛
𝑖𝑖=1
=
1
𝜎𝜎2
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1 +
1
𝜎𝜎2
∑ (𝑥𝑥̅ − 𝜇𝜇)2𝑛𝑛
𝑖𝑖=1 +
2(𝑥𝑥̅−𝜇𝜇)
𝜎𝜎2
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)𝑛𝑛
𝑖𝑖=1
=
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 + �
𝑥𝑥̅− 𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
�
2
since ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)𝑛𝑛
𝑖𝑖=1 = 0
Now, it is known that the LHS of the above equation is a random variable which has a chi-square
distribution with n degrees of freedom. It is also known that –
𝑥𝑥̅~ 𝑁𝑁 �𝜇𝜇.
𝜎𝜎2
𝑛𝑛
�
∴ �
𝑥𝑥̅− 𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
�
2
will have a chi-square distribution with 1 degree of freedom.
Hence, if the two terms on the right hand side of the above equation are independent (which will be
assumed as true here), then it follows that
(𝑛𝑛−1) 𝑠𝑠2
𝜎𝜎2 has a chi-square distribution with (n — 1) degrees
of freedom. One degree of freedom is lost because the deviations are measured from 𝑥𝑥̅ and not
from 𝜇𝜇.
Expected Value and Variance of 𝒔𝒔𝟐𝟐
26
The mean of a chi-square distribution is equal to its degrees of freedom and the variance is equal to
twice the degrees of freedom. This can be used to find the expected value and the variance of 𝒔𝒔𝟐𝟐
.
Since
(𝑛𝑛−1) 𝑠𝑠2
𝜎𝜎2 has a chi-square distribution with (n-1) degrees of freedom,
∴ 𝐸𝐸 �
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 � = 𝑛𝑛 − 1 Or
(𝑛𝑛−1)
𝜎𝜎2 . 𝐸𝐸 (𝑠𝑠2) = 𝑛𝑛 − 1
∴ 𝐸𝐸 (𝑠𝑠2) = 𝜎𝜎2
Also, Var �
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 � = 2(𝑛𝑛 − 1)
Using the definition of Variance, we get
𝐸𝐸 �
(𝑁𝑁−1)𝑆𝑆2
𝜎𝜎2 − 𝐸𝐸 �
(𝑁𝑁−1)𝑆𝑆2
𝜎𝜎2 ��
2
= 2(𝑁𝑁 − 1) Or, 𝐸𝐸 �
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 − (𝑛𝑛 − 1)�
2
− 2(𝑛𝑛 − 1)
Or,
(𝑛𝑛−1)2
𝜎𝜎4 𝐸𝐸 (𝑠𝑠2
− 𝜎𝜎2)2
= 2(𝑛𝑛 − 1) ∴ 𝐸𝐸 (𝑠𝑠2
− 𝜎𝜎2)2
=
2𝜎𝜎4
(𝑛𝑛−1)
i.e 𝑉𝑉𝑉𝑉𝑉𝑉 (𝑠𝑠2) =
2𝜎𝜎4
𝑛𝑛−1
since the expected value of 𝑠𝑠2
is equal to 𝜎𝜎2
.
It can therefore, be conclude that if we take a large number of samples, each with a sample size on
n, from a normal population with mean 𝜇𝜇 and variance 𝜎𝜎2
, each sample will perhaps have a different
value for its sample variance 𝑠𝑠2
. But the average of a large number of values of 𝑠𝑠2
will be close to
𝜎𝜎2
. Also, the variance of 𝑠𝑠2
falls as the sample size increases.
Its important to note here that all the above conclusions are based on the assumption that the
population is distributed normally. If the population does not have a normal distribution, then
nothing can be said about the distribution of 𝑠𝑠2
.
2.5 Sampling Distribution of the Proportion
Let us assume that 0.80 of all students in a school can pass a test of physical fitness. A random
sample of 20 students is chosen: 13 passed and 7 failed. The parameter π is used to designate the
proportion of subjects in the population that pass (0.80 in this case) and the statistic p is used to
designate the proportion who pass in a sample (13/20= 0.65 in this case). The sample size (N) in this
example is 20. If repeated samples of size N where taken from the population and the proportion
passing (p) were determined for each sample, a distribution of values of p would be formed. If the
sampling went on forever, the distribution would be the sampling distribution of a proportion. The
sampling distribution of a proportion is equal to the binomial distribution. The mean and standard
deviation of the binomial distribution are:
𝜇𝜇 = 𝜋𝜋 and 𝜎𝜎𝑝𝑝 = �
𝜋𝜋(1−𝜋𝜋)
𝑁𝑁
For the present example, N = 20, π = 0.80, the mean of the sampling distribution of 𝑝𝑝(𝜇𝜇) is 0.8 and
the standard error of 𝑝𝑝�𝜎𝜎𝑝𝑝� is 0.089. The shape of the binomial distribution depends on both N and
27
π. With large values of N and values of π in the neighborhood of 0.5, the sampling distribution is very
close to a normal distribution.
Assume that for the population of people applying for a job at a bank in a major city, 0.40 are able
to pass a basic literacy test required to get the job. Out of a group of 20 applicants, what is the
probability that 50% or more of them will pass? This problem involves the sampling distribution of p
with π = 0.40 and N = 20. The mean of the sampling distribution is π = 0.40. The standard deviation
is:
𝜎𝜎𝑝𝑝 = �
𝜋𝜋(1−𝜋𝜋)
𝑁𝑁
= �
0.40(1−0.40)
20
= 0.11
Using the normal approximation, a proportion of 0.50 is: (0.50-0.40)/0.11 = 0.909 standard
deviations above the mean. From a z table it can be calculated that 0.818 of the area is below a z of
0.909. Therefore the probability that 50% or more will pass the literacy test is only about 1 - 0.818 =
0.182.
2.6 The Confidence Level
The sample mean is researchers estimate of the population mean. If we are asked to give an interval
as our estimate, then we would add a range on the upper and the lower side of the sample mean
and give that interval as our estimate. The larger the interval, the greater is our confidence that the
interval does contain the true population mean. It is to be noted that the true population mean is a
constant and is not a variable. On the other hand, the interval that we specify is a random interval
whose position depends on the sample mean. For example if the sample mean is 50 and the standard
error of the mean is 5, we may specify our interval estimate as (45,55) i.e. from 45 to 55 which spans
one standard error of the mean on either side of the sample mean. On the other hand, if the interval
estimate is specified as (40,60) i.e. spanning two standard errors of the mean on either side of the
sample mean, we are more confident that the latter interval contains the true population mean as
compared to the former. However, if the confidence level is raised too high, the corresponding
interval may become too wide to be of any practical use.
The confidence level, therefore, may be defined as the probability that the interval estimate will
contain the true value of the population parameter that is being estimated. If we say that a 95%
confidence interval for the population mean is obtained by spanning 1.96 times the standard error
of the mean on either side of the sample mean, we mean that we take a large number of samples of
size n, say 1000, and obtain the interval estimates from each of these 1000 samples and then 95%
of these interval estimate would contain the true population mean.
Confidence Interval for the Population Mean
Let us now discuss how to obtain a confidence interval for the population mean. We shall assume
that the population distribution is normal and that the population variance is known. Later, we shall
relax the second condition.
Suppose it is known that the weight of cement in packed bags is distributed normally with a standard
deviation of 0.2 Kg. A sample of 25 bags is picked up at random and the mean weight of cement in
28
these 25 bags is only 49.7 Kg. We want to find a 90% confidence interval for the mean weight of
cement in filled bags.
Let x be a random variable representing the weight of cement in a bag picked up at random. We
know that x is distributed normally with a standard deviation of 0.2 Kg.
The standard error of the mean can be easily calculated as
𝜎𝜎𝑥𝑥̅ =
𝜎𝜎
√ 𝑛𝑛
=
0.2
√25
= 0.04 𝐾𝐾𝐾𝐾
We can use the above approach when the population standard deviation is known or when the
sample size is large n > 30 , in which case the sample standard deviation can be used as an estimate
of the population standard deviation. However, if the sample size is not large, as in the example
above, then one has to use the t distribution in place of the standard normal distribution to calculate
the probabilities. Let us assume that we are interested in developing a 90% confidence interval in
the same situation as described earlier with the difference that the population standard deviation is
now not known. However, the sample standard deviation has been calculated and is known to be
0.2 Kg.
Since the sample size n = 25, we know that
𝑥𝑥̅− 𝜇𝜇
𝑠𝑠
√ 𝑛𝑛�
follows a t-distribution with 24 degrees of freedom.
From t-tables, we can see that the probability that a t statistic with 24 degrees of freedom lying
between - 1.711 and 1.711 is 0.90 -i.e. the probability that 𝑥𝑥̅ lies between −1.711 𝑠𝑠 √𝑛𝑛⁄ and
+1.711 𝑠𝑠 √𝑛𝑛⁄ is 0.90.
In other words, if we use an interval spanning from (𝑥𝑥 − 1.711 𝑠𝑠 √𝑛𝑛⁄ ) to (𝑥𝑥 + 1.711 𝑠𝑠 √𝑛𝑛⁄ )
then 90% of the time, this interval will contain 𝜇𝜇 . Hence, for a 90% confidence interval,
The lower limit = 𝑥𝑥̅ − 1.711
𝑠𝑠
√ 𝑛𝑛
or 49.7 − 1.711
0.2
√25
or 49.6316
And the upper limit = 𝑥𝑥̅ + 1.711
𝑠𝑠
√ 𝑛𝑛
or 49.7 + 1.711
0.2
√25
or 49.7684
In this case, we can state with 90% confidence level that the mean weight of cement in a filled bag
lies between 49.6316 Kg and 49.7684 Kg.
Using the derivations and relations we can calculte the sample size that will be ideal for a particular
study for an expected confidence level.
***
29
Bibliography
1. http://www.nku.edu/~statistics/212_Sampling_Distribution_of_P-hat.htm
2. http://en.wikipedia.org/wiki/Sampling_distribution
3. http://en.wikipedia.org/wiki/Sampling_(statistics)
4. http://onlinestatbook.com
5. Course material on ‘Quantitative analysis for Managerial Applications’, MS-8, 1997, IGNOU,
Maidan Garhi, New Delhi.
6. Course material on ‘Research Methodology for Management Decisions’, MS-95, 1997, IGNOU,
Maidan Garhi, New Delhi.
7. http://stattrek.com/sampling/sampling-distribution.aspx
8. http://www.psychstat.missouristate.edu/introbook/sbk19.htm
9. http://www.stat.berkeley.edu/~stark/SticiGui/Text/index.htm
10. http://www.fao.org/docrep/w7295e/w7295e08.htm#6
***

More Related Content

What's hot

What's hot (20)

Sampling Distribution
Sampling DistributionSampling Distribution
Sampling Distribution
 
Estimating population mean
Estimating population meanEstimating population mean
Estimating population mean
 
Lesson 6 coefficient of determination
Lesson 6   coefficient of determinationLesson 6   coefficient of determination
Lesson 6 coefficient of determination
 
Applications to Central Limit Theorem and Law of Large Numbers
Applications to Central Limit Theorem and Law of Large NumbersApplications to Central Limit Theorem and Law of Large Numbers
Applications to Central Limit Theorem and Law of Large Numbers
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
6.5 central limit
6.5 central limit6.5 central limit
6.5 central limit
 
Time series-ppts.ppt
Time series-ppts.pptTime series-ppts.ppt
Time series-ppts.ppt
 
Time series Analysis
Time series AnalysisTime series Analysis
Time series Analysis
 
Skewness
SkewnessSkewness
Skewness
 
Time series analysis
Time series analysis Time series analysis
Time series analysis
 
Time series slideshare
Time series slideshareTime series slideshare
Time series slideshare
 
Normal Distribution
Normal DistributionNormal Distribution
Normal Distribution
 
Variance & standard deviation
Variance & standard deviationVariance & standard deviation
Variance & standard deviation
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Central tendency
Central tendencyCentral tendency
Central tendency
 
4. parameter and statistic
4. parameter and statistic4. parameter and statistic
4. parameter and statistic
 
STATISTICS: Normal Distribution
STATISTICS: Normal Distribution STATISTICS: Normal Distribution
STATISTICS: Normal Distribution
 
1634 time series and trend analysis
1634 time series and trend analysis1634 time series and trend analysis
1634 time series and trend analysis
 
Contingency tables
Contingency tables  Contingency tables
Contingency tables
 

Similar to Sampling and Sampling Distribution

Certified Specialist Business Intelligence (.docx
Certified     Specialist     Business  Intelligence     (.docxCertified     Specialist     Business  Intelligence     (.docx
Certified Specialist Business Intelligence (.docxdurantheseldine
 
Cenduit_Whitepaper_Forecasting_Present_14June2016
Cenduit_Whitepaper_Forecasting_Present_14June2016Cenduit_Whitepaper_Forecasting_Present_14June2016
Cenduit_Whitepaper_Forecasting_Present_14June2016Praveen Chand
 
census, sampling survey, sampling design and types of sample design
census, sampling survey, sampling design and types of sample designcensus, sampling survey, sampling design and types of sample design
census, sampling survey, sampling design and types of sample designParvej Ahmed Porag
 
Marketing Research.pdf
Marketing Research.pdfMarketing Research.pdf
Marketing Research.pdfAkshat470463
 
Demande forecasating
Demande forecasatingDemande forecasating
Demande forecasatingAntriksh Cool
 
marketing research ch 3.ppt
marketing research ch 3.pptmarketing research ch 3.ppt
marketing research ch 3.pptamitshaha3
 
3rd alex marketing club (pharmaceutical forecasting) dr. ahmed sham'a
3rd  alex marketing club (pharmaceutical forecasting) dr. ahmed sham'a3rd  alex marketing club (pharmaceutical forecasting) dr. ahmed sham'a
3rd alex marketing club (pharmaceutical forecasting) dr. ahmed sham'aMahmoud Bahgat
 
How to design questionnaire
How to design questionnaireHow to design questionnaire
How to design questionnairesukesh gowda
 
Retail Audit of Lux, Lifebuoy & Breeze
Retail Audit of Lux, Lifebuoy & BreezeRetail Audit of Lux, Lifebuoy & Breeze
Retail Audit of Lux, Lifebuoy & Breezeshubhamsureka6
 
Sampling design 1216114348242957-8
Sampling design 1216114348242957-8Sampling design 1216114348242957-8
Sampling design 1216114348242957-8rgwax
 
Marketing Research PPT - III
Marketing Research PPT - IIIMarketing Research PPT - III
Marketing Research PPT - IIIRavinder Singh
 

Similar to Sampling and Sampling Distribution (20)

Certified Specialist Business Intelligence (.docx
Certified     Specialist     Business  Intelligence     (.docxCertified     Specialist     Business  Intelligence     (.docx
Certified Specialist Business Intelligence (.docx
 
Cenduit_Whitepaper_Forecasting_Present_14June2016
Cenduit_Whitepaper_Forecasting_Present_14June2016Cenduit_Whitepaper_Forecasting_Present_14June2016
Cenduit_Whitepaper_Forecasting_Present_14June2016
 
census, sampling survey, sampling design and types of sample design
census, sampling survey, sampling design and types of sample designcensus, sampling survey, sampling design and types of sample design
census, sampling survey, sampling design and types of sample design
 
Techniques of data collection
Techniques of data collectionTechniques of data collection
Techniques of data collection
 
Marketing Research.pdf
Marketing Research.pdfMarketing Research.pdf
Marketing Research.pdf
 
Sampling Technique
Sampling TechniqueSampling Technique
Sampling Technique
 
Demande forecasating
Demande forecasatingDemande forecasating
Demande forecasating
 
marketing research ch 3.ppt
marketing research ch 3.pptmarketing research ch 3.ppt
marketing research ch 3.ppt
 
Report
ReportReport
Report
 
Audit sampling
Audit samplingAudit sampling
Audit sampling
 
Using directobservationstechniques
Using directobservationstechniquesUsing directobservationstechniques
Using directobservationstechniques
 
3rd alex marketing club (pharmaceutical forecasting) dr. ahmed sham'a
3rd  alex marketing club (pharmaceutical forecasting) dr. ahmed sham'a3rd  alex marketing club (pharmaceutical forecasting) dr. ahmed sham'a
3rd alex marketing club (pharmaceutical forecasting) dr. ahmed sham'a
 
Report
ReportReport
Report
 
Sampling
SamplingSampling
Sampling
 
Work sampling
Work samplingWork sampling
Work sampling
 
Sampling methods
Sampling methodsSampling methods
Sampling methods
 
How to design questionnaire
How to design questionnaireHow to design questionnaire
How to design questionnaire
 
Retail Audit of Lux, Lifebuoy & Breeze
Retail Audit of Lux, Lifebuoy & BreezeRetail Audit of Lux, Lifebuoy & Breeze
Retail Audit of Lux, Lifebuoy & Breeze
 
Sampling design 1216114348242957-8
Sampling design 1216114348242957-8Sampling design 1216114348242957-8
Sampling design 1216114348242957-8
 
Marketing Research PPT - III
Marketing Research PPT - IIIMarketing Research PPT - III
Marketing Research PPT - III
 

Recently uploaded

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 

Sampling and Sampling Distribution

  • 2. 1 Sampling and Sampling Distribution  Umesh K. Pandey M.Sc., MBA, MBL, DIPC, Ph.D. © The Author Important notice on reuse, reproduction or commercial use: Complete reproduction without alteration of the content, partial or as a whole, is permitted for non-commercial, personal and academic purposes without a prior permission provided such reproduction includes full citation of the article, an acknowledgement of the copyright and link to the article. The author should be informed about this use if more than one copy is being made or the content, partial or as a whole, is being reproduced on a website, intranet or any other electronic media. Contents of this ebook, partial or as a whole, should not be included in a framed web page. Contents of this site, partial or as a whole, should not be included in a password protected site or a site which requires registration, even if free. Contents of this ebook, partial or as a whole, should not be included in a site which charges for other contents but provides the content from this site for free. A site or any electronic media reproducing content from this ebook if includes advertisements placed along with our content or generates any form of revenue due to the contents of this ebook, should share the revenue with the author. For seeking permission for commercial reuse please contact author at yukaypee@hotmail.com
  • 3. 2 INDEX Sampling and Sampling Distribution Page No 1. Sampling 3 1.1 What is sampling ? 3 1.2 Why Sampling instead of Census? 4 1.3 Sampling methods 7 1.3.1 Probability Sampling Methods 7 1.3.1.1 Simple Random Sampling 8 1.3.1.2 Systematic Sampling 9 1.3.1.3 Stratified 10 1.3.1.4 Cluster Sampling 12 1.3.2 Non-Probability Sampling Methods 13 1.3.2.1 Convenience Sampling 14 1.3.2.2 Purposive Sampling 14 1.3.2.2.1 Judgment Sampling 15 1.3.2.2.2 Quota Sampling 15 2. Sampling Distribution 16 2.1 Sampling Distribution of the Mean 18 2.2 The Central Limit Theorem 22 2.3 Sampling Distribution of the variance 23 2.4 The Chi-square Distribution 24 2.5 Sampling Distribution of the proportion 26 2.6 The Confidence Level 27 3. Bibliography 29 ***
  • 4. 3 1. Sampling When managers use research, they are applying the methods of science to the art of management. Business operates in the world of uncertainity and there is no unique method which can entirely eliminate this uncertainity. Nevertheless, the research methodology can indeed minimise the extent of uncertainity and can reduce the probability of making a wrong choice amongst alternative courses of action. Therefore, the increasingly complex nature of of business and governance focusses more and more attention on the use of research methodology in solving managerial problems. In the prevailing highly involved environment neither a business decision nor a governmental decision can be made casually or based on intutions. It is through appropriate data and their analysis that the decision maker becomes equipped with proper tools of decision making. Needless to say the the credibility of the results derived from the application of such methodology is dependent upon the reliability of the data included in the analysis. The quantitative tool of inferential statistics is extensively used to address managerial and business problems by using the relevant data. The inferential statistics are the quantitative tools that use samples to estimate something about a population parameter above what can possibly happen by chance. Good research is only as good as the design, methods, and statistics used. Yet, the design, methods, and statistics are useless if first, an optimal sample is not used. Thus sampling is the corner stone of any business research. 1.1 What is Sampling ? The terminology “sampling” indicates the selection of a part of a group or an aggregate with a view to obtaining information about the whole. This aggregate or the POPULATION SAMPLE STATISTICPARAMETER Sampling Estimation Inference Figure 1: Research Methodology
  • 5. 4 totality of all members is known as Population although they need not be human beings. The selected part, Which is used to ascertain the characteristics of the population is called Sample. While choosing a sample, the population is assumed to be composed of individual units or members, some of which are included in the sample. The total number of members of the population and the number included in the sample are called Population Size and Sample Size respectively. The concept can be shown through the following venn diagram where the population is an universal set and sample is shown as a true subset. Population: Set of all items Sample: Set of chosen items The process of generalising on the basis of information collected on a part is really a traditional practice. With the advancement of management science more sophisticated applications of sampling in business and industry are available. Sampling methodology can be used by an auditor or an accountant to estimate the value of total inventory in the stores without actually inspecting all the items physically. Opinion polls based on samples are used to forecast the result of a forthcoming election. 1.2 Why Sampling instead of Census? The census or complete enumeration consists in collecting data from each and every unit from the population. The sampling only chooses a part of the units from the population for the same study. The sampling has a number of advantages as compared to complete enumeration due to a variety of reasons. Cost The first obvious advantage of sampling is that it is less expensive. If we want to study the consumer reaction before launching a new product it will be much less expensive to Figure 2: Population & Sample
  • 6. 5 carry out a consumer survey based on a sample rather than studying the entire population which is the potential group of customers. Although in decennial census every individual is enumerated, certain aspects of the population are studied on a sample basis with a view to reduce cost. Time The smaller size of the sample enables us to collect the data more quickly than to survey all the units of the population even if we are willing to spend money. This is particularly the case if the decision is time bound. An accountant may be interested to know the total inventory value quickly to prepare a periodical report like a quarterly balance sheet and a profit and loss account. A detailed study on the inventory is likely to take too long to enable him to prepare the report in time. If we want to measure the Consumer Price Index in a particular month we cannot collect data of all the consumer prices even if the expenditure is not a hinderance. The collection of data on all the consumer items and their processing in all probability are going to take pretty long time. Thus when ready, the price index will not serve any meaningful purpose. Accuracy It is possible to achieve greater accuracy by using appropriate sampling techniques than by a complete enumeration of all the units of the population. Contrary to the common belief, complete enumeration may result in inaccuracies of the data owing to the fatigue of the enumerator or spurious & unreliable data collected in view of large volume. On the other hand, if a small number of items is observed the basic data will be much more accurate. It is of course true that the conclusion about a population characteristic such as the proportion of defective items from a sample will also introduce error in the system. However, such errors, known as sampling errors, can be studied, controlled and probability statements can be made about their magnitude. The accuracy which results due to fatigue of the inspector is known as non sampling error. It is difficult to recognise the pattern of the non sampling error and it is not possible to make any comment about its magnitude even probabilistically.
  • 7. 6 Reliability of Inference In many cases, sampling provides adequate information so that not much additional reliability can be gained with complete enumeration in spite of spending large amounts of additional money and time. It is also possible to quantify the magnitude of possible error on using some types of sampling which is not the case in census approach. Impossibility of complete enumeration In many situations the item being studied gets destroyed while being tested. Sampling is indispensable under such circumstances.if one is interested in computing the average life of Compact Fuoroscent Lamps (CFL) supplied in a batch, the life of entire batch cannot be examined to compute the average life since this means that the entire supplywill be wasted. Thus in such cases there is no other alternative than to examine the life of a sample of CFLs and draw an inference about the entire batch. Infeasibility of complete enumeration More often than not, it is practically infeasible to do a complete enumeration due to many practical difficulties. For example, if a shaving gel manufacturer wants to launch a new & improved version of its gel. For getting consumer feedback the manufacturer distributed old version of gel to say 500 consumers and after a week or so replaced it with the new version to get feedback on various attributes of the product. In this situation, it would be infeasible to collect information from all the consumers of shaving gel in India. Some consumers would have moved from one place to another during the period of study, some others would have stopped consuming shaving gel just before the period of study whereas some others would have been users of shaving gel during the period of study but would have stopped using it some time later. In such situations, although it is theoretically possible to do a complete enumeration, it is practically infeasible to do so. The above account clearly establishes that a research study gives more reliable results at a greater convenience by way of sampling as compared with the study of the entire population.
  • 8. 7 1.3 Sampling methods Good research is as only as good as the design, methods, and statistics used. Yet, the design, methods, and statistics are useless if first, we don’t use an optimal sample. A Sampling frame is a list of all the units of the population. The preparation of a sampling frame should always be upto date and be free from errors of omission and duplication of sampling units. A perfect frame identifies each element once and only once. Perfect frames are seldom available in real life. Nevertheless, it needs to be ensured that the sampling frame is complete, accurate, adequate and up-to-date. Further, depending on the requirement of various possibilities in research sampling methods are broadly categorized into two groups viz Probability sampling methods and Non probability sampling methods, as depicted in Figure 3. 1.3.1 Probability Samplling Methods In probability sampling methods the population from which the sample is drawn should be known to the researcher. Under this sampling design every item of the population has an equal chance of inclusion in the sample. Lottery methods or selecting a student from the complete students names from a box with blind or folded eyes is the best example of random sampling, it is the best technique and unbiased method. It is the best process of selecting representative sample.But the major disadvantage is that for this technique we need the complete sampling frame i.e. the list of the complete items or population which is not always available. The probability sampling methods are of four types viz Simple Random Sampling, Systematic Sampling, Stratified Sampling and Cluster Sampling. Sampling Probability Non-probability Simple Random Systematic Stratified Cluster Convenience Purposive Quota Judgment Figure 3: Sampling Methods
  • 9. 8 1.3.1.1 Simple Random Sampling Simple random sampling is based on the concept of probability. The use of probability in sampling theory makes it a reliable tool to draw inference or conclusion about the population. Although the types of conclusion or inference can be quite diverse, two particular types of decision making are quite prevalent in problems of business and government. On various occasions, the management would like to know the percentage or proportion of units in the population with a certain characteristic. An organisation selling consumer product may like to know the proportion of potential consumers using a certain type of cosmetic. The government may like to know the percent of small farmers owning some cultivable land in a rural region. A manufacturer planning to export some product may be interested to ascertain the proportion of defect free units his system is capable of manufacturing. The representative character of a sample is ensured by allocating some probability to each unit of the population for being included in the sample. The simple random sample assigns equal probability to each unit of the population. The simple random sample can be chosen both with and without replacement. Simple Random Sampling with Replacement Suppose the population consists of N units and we want to select a sample of size n. In simple random sampling with replacement we choose an observation from the population in such a manner that every unit of the population has an equal chance of 1/N to be included in the sample. After the first unit is selected its value is recorded and it is again placed back in the population. The second unit is drawn exactly in the same manner as the first unit This procedure is continued until nth unit of the sample is selected. Obviously, in this case each unit of the population has an equal chance of 1/N to be included in each of the n units of the sample.
  • 10. 9 Simple Random Sampling without Replacement In this case when the first unit is chosen every unit of the population has a chance of 1/N to be included in the sample. After the first unit is chosen it is no longer replaced in the population. The second unit is selected from the remaining ‘N-1’ members of the population so that each unit has a chance of 𝟏𝟏 𝑵𝑵−𝟏𝟏 to be included in the sample. The procedure is continued till nth unit of the sample is chosen with probability 𝟏𝟏 [𝑵𝑵−𝒏𝒏+𝟏𝟏] . Random numbers for simple random sampling are generated using probabilistic mechanism. 1.3.1.2 Systematic Sampling Systematic sampling involves selecting items using a constant interval between the selections depending on the sampling ratio – first interval having a random start. For example, if a sample of size 10 from a population of size 100 is required, the sampling ratio would be n/N = 10/100= 1/10. It would, therefore, have to be decided where to start from among the first 10 names in our sampling frame. If this number happens to be 5 for example, then the sample would contain members having serial numbers 5, 15, 25, 35, ……. 95 in the frame. It is noteworthy that the random process establishes only the first member of the sample - the rest are pre-determined because of the known sampling ratio. Usually the starting serial number of sample is decided by allowing chance to play its role by using a table of random numbers. In other words, the sampling starts by selecting an element from the list at random and then every kth element in the frame is selected, where k, the sampling interval (sometimes known as the skip): this is calculated as 𝒌𝒌 = 𝑵𝑵 𝒏𝒏 where n is the sample size, and N is the population size. Systematic sampling is relatively much easier to implement compared to simple random sampling. However, there is one possibility that should be guarded against while using systematic sampling - the possibility of a strong bias in the results if there is any periodicity in the frame that parallels the sampling ratio. For example if someone were making studies Figure 4: Systematic Sampling
  • 11. 10 on the demand for various banking transactions in a bank branch by studying the demand on some days randomly selected by systematic sampling and the chosen sampling ratio is 1/7 or 1/14 etc, he would always be studying the demand on the same day of the week and the inferences could be biased depending on whether the day selected is a Monday or a Friday and so on. If the frame is arranged in an order, ascending or descending, of some attribute then the location of the first sample element may affect the result of the study. For example, if the frame contains a list of students arranged in a descending order of their percentage in the previous examination and we are picking a systematic sample with a sampling ratio of 1/50. If the first number picked is 1 or 2, then the sample chosen will be academically much better off compared to another systematic sample with the first number chosen as 49 or 50. In such situations, one should devise ways of nullifying the effect of bias due to starting number by insisting on multiple starts after a small cycle or other such means. On the other hand, if the frame is so arranged that similar elements are grouped together, then systematic sampling produces almost a proportional stratified sample and would be, therefore, more statistically efficient than simple random sampling. Systematic sampling is perhaps the most commonly used method among the probability sampling designs and for many purposes e.g. for estimating the precision of the results, systematic samples are treated as simple random samples. 1.3.1.3 Stratified Sampling The simple random sampling may not always provide a representative snapshot of the population. Certain segments of a population can easily be under represented when an unrestricted random sample is chosen. Hence, when considerable heterogeneity is present in the population with regard to subject matter under n1 N1 Stratum-1 Stratum-pStratum-2 N2 Np n2 np Figure 5: Sratified Sampling
  • 12. 11 study, it is often a good idea to divide the population into segments or strata and select a certain number of sampling units from each stratum thus ensuring representation from all relevant segments. Thus for designing a suitable marketing strategy for a consumers durable, the population of consumers may be divided into strata by income level and a certain number of consumers can be selected randomly from each strata. Therefore, in stratified random sampling the population is first divided into different homogeneous group or strata which may be based upon a single criterion such as male or female. Or upon combination of more criteria like sex, caste, level of education and so on. This method is generally applied when different category of individuals constitutes the population viz General, OBC, SC, ST or upper income, middle income, lower income or small farmers, big farmers, marginal farmers landless farmers etc. To have an actual picture of a particular population about the standard of living, in case of India it is advisable to categorized the population on the basis of caste, religion or land holding otherwise some section may be under-represented or not represented at all. Stratified random sampling may be of Proportionate Stratified Random Sampling or Dis- Proportionate Stratified Random Sampling. Proportionate Stratified Random Sampling In case of proportionate random sampling method, the researcher stratifies the population according to known characteristics and subsequently, randomly draws the sample in a similar proportion from each stratum of the population according to its proportion. That is, the population is divided into several sub-populations depending upon some known characteristics, this sub population is called strata and they are homogeneous. For example, a town area committee consists of 15000 voters among which 60% are Hindus, 30% are Muslims and 10% are others and the researcher wants to draw a sample of 300 voters from the population as per their proportion. That can be done by multiplying the sample number with their proportion; as per this method the sample size of Hindu voter will be 300 x 60% = 180, Muslims will be 300 x 30% = 90 and others will be 300 x 10% = 30. So the researcher has to collect the complete voter list of the town and randomly select the sample from each category as calculated above. In this method the sampling error is minimized and the sample possesses all the required characteristics of the population.
  • 13. 12 Disproportionate Stratified Random Sampling In this method the sampling unit in each stratum is not necessarily be as per their population. Suppose for the said town the researcher wants to the know the voting pattern of male and female of Hindu, Muslim and other voters; in that case he must take equal no. of male and female voter from each category. Here the investigator has to give equal weightage to each stratum. This is a biased type of sampling and in this case some stratum is over-represented and some are less-represented; these are not truly representative sampling, still this to be used in some special cases. If the different strata in the population have unequal variances of the characteristic being measured, then the sample size allocation decision should consider the variance as well. It would be logical to have a smaller sample from a stratum where the variance is smaller than from another stratum where the variance is higher. In fact, if 𝜎𝜎1 2 , 𝜎𝜎2 2 , … … , 𝜎𝜎𝑝𝑝 2 are the variance of the p strata, then the statistical efficiency is the highest when – 𝑛𝑛1 𝑁𝑁1 𝜎𝜎1 = 𝑛𝑛1 𝑁𝑁2 𝜎𝜎2 = ⋯ = 𝑛𝑛𝑝𝑝 𝑁𝑁𝑝𝑝 𝜎𝜎𝑝𝑝 1.3.1.4 Cluster sampling This is another type of probability sampling method, in which the sampling units are not individual elements of the population, but group of elements or group of individuals are selected as sample. In cluster sampling the total population is divided into a number of relatively small sub-divisions or groups which are themselves clusters and then some of these cluster are randomly selected for inclusion in the sample. Suppose a researcher wants to study the functioning of mid day meal service in a district in that case he can use some schools clustering in a block or two without selecting the schools scattering all over the district. Cluster sampling reduces the cost and labour of collecting the data of the researcher but less precise than random sampling. We can now compare Cluster Sampling with Stratified Sampling. Stratification is done to make the strata homogeneous within and different from other strata. Clusters, on the other hand, should be heterogeneous within and the different clusters should be similar to each other. A clusture, ideally, is a mini-population and has all the features of the population. The criterion used for stratification is a variable which is closely associated with the characteristic we are measuring e.g. income level when we are measuring the family consumption of non-aerated
  • 14. 13 beverages. On the other hand, convenience of data collection is usually the basis for cluster definitions. Geographic contiguity is quite often used for clusture definitions and in such cases, cluster sampling is also known as Area Sampling. There are very fewer strata and one requires to pick up a random sample from each of the strata for drawing inferences. In cluster sampling, there are many clusters out of which only a few are picked up by random sampling and then the clusters are completely enumerated. Multi-stage and Multi-phase Sampling In this method sampling is drawn more than once . This is used in most of the large surveys where the sampling unit is something larger than an individual element of the population in all stages but the final. For example, in a national survey on the demand of fertilizers one might use stratified sampling in the first stage with a district as a sampling unit and the average rainfall in the district as the criterion for stratification. Having obtained 20 districts from this stage, cluster sampling may be used in the second stage to pick up 10 villages in each of the selected districts. Finally, in the third stage, stratified sampling may be used in each village to pick up frames in each of the strata defined with land holding as the criterion. Multi-phase sampling, on the other hand, is designed to make use of the information collected in one phase to develop a sampling design in a subsequent phase. A study with two phases is often called Double Sampling. The first phase of the study might reveal a relationship between the family consumption of non-aerated beverages and the family income and this information would then be used in the second phase to stratify the population with family income as the criterion. 1.3.2 Non Probability Sampling Methods Probability sampling has some theoretical advantages over non-probability sampling. The bias introduced due to sampling could be completely eliminated and it is possible to set a confidence interval for the population parameter that is being studied. In spite of these advantages of probability sampling, non-probability sampling is used quite frequently in many sampling surveys. This is so because all are based on practical considerations. Probability sampling requires a list of all the sampling units and this frame is not available in many situations nor is it practically feasible to develop a frame of say all the households in a city or zone or ward of a city. Sometimes the objective of the stuc may not be to draw a statistical inference about the population but to get familiar wit extreme cases or other such objectives. In a dealer survey, our objective may be to get familiar with the problems faced by our dealers so that we can
  • 15. 14 take some corrective actions, wherever possible. Probability sampling is rigorous and this rigour e.g. in selecting samples, adds to the cost of the study. And finally, even when we are doing probability sampling, there are chances of deviations from the laid out process especially where some samples are selected by the interviewers at site - say after reaching a village. Also, some of the sample members may not agree to be interviewed or not available to be interviewed and our sample may turn out to be a non-probability sample in the strictest sense of the term. 1.3.2.1 Convenience Sampling In this type of non-probability sampling, the choice of the sample is left completely to the convenience of the researcher. The cost involved in picking up the sample is minimum and the cost of data collection is also generally low, e.g. the researcher can go to same retail shops and interview some shoppers while studying the demand for some commodity. Another form of convenience sampling is known as ‘Snow Ball Sampling’. This is a sociometric sampling technique generally used to study the small groups. All the persons in a group identify their friends who in turn know their friends and colleagues, until the informal relationships converge into some type of a definite social pattern. It is like a snow ball increasing its size as it rolls down an ice- field. For example in case of research regarding drug addict people it is difficult to find out who are the drug users but when one person is identified he can tell the names of his partners then each of his partner can tell another 2 or 3 names whom he knows uses drug . This way the required number of elements/persons are identified and data is collected. This method is suitable for diffusion of innovation, network analysis, decision making. However, such samples can suffer from excessive bias from known or unknown sources and also there is no way that the possible errors can be quantified. 1.3.2.2 Purposive Sampling In convenience sampling, any member of the population can be included in the sampl without any restriction. When some restrictions are put on the possible inclusion of a member in the sample, the sampling is called purposive. This is a non random sampling method where the researcher selects the sample arbitrarily which he considers important for the research and believes it as typical and representative of the population. Say, a researcher wants to forecast the chance of coming into the power of a political party in general election. He may select some reporters, some teachers and some elite people of the territory and collect their opinions for the purpose of his study. He considers those are the leading persons and their view are relevant for the chance of coming in to the power
  • 16. 15 of the party. As it is a purposive method it has big sampling errors and carry misleading conclusion. The purposive sampling is broadly of two types viz Judgment Sampling and Quota Sampling. 1.3.2.2.1 Judgment Sampling In judgment sampling, the judgment or opinion of some experts forms the basis for sample selection. The experts are persons who are believed to have information on the population which can help in giving us better samples. Such sampling is very useful when we want to study rare events, or when members have extreme positions, or even when the objective of the study is to collect a wide cross- section of views from one extreme to the other. 1.3.2.2.2 Quota Sampling Even while using non-probability sampling, one might want our sample to be representative of the population in some defined ways. This is sought to be achieved in quota sampling so that the bias introduced by sampling could be reduced. If in a given population, 25% of the members belong to the high income group, 25% to the middle income group, 35% to the low income group and 15 % are Below Poverty Line (BPL) and we are using quota sampling, we would specify that the sample should also contain members in the same proportion as in the population e.g. 15% of the sample members would belong to the BPL group and so on. The criteria used to set quotas could be many. For example, family size could be another criterion and we can set quotas for families with family size upto 3, between 4 & 5, and above 5. However, if the number of such criteria is large, it becomes difficult to locate sample members satisfying the combination of the criteria. In such cases, the overall relative frequency of each criterion in the sample is matched with the overall relative frequency of the criterion in the population. This method of sampling is almost same with that of stratified random sampling as stated above, the only difference is that here in selecting the elements randomization is not done instead quota is taken into consideration. As quota sampling is not random so sampling method is biased and lead to large sampling errors.
  • 17. 16 2. The Sampling Distribution Sample statistics form the basis of all inferences drawn about populations. If we know the probability distribution of the sample statistic, then we can calculate the probability that the sample statistic assumes a particular value (if it is a discrete random variable) or has a value in a given interval. This ability to calculate the probability that the sample statistic lies in a particular interval is the most important factor in all statistical inferences. Let’s demonstrate this by an example. Suppose we know that 55% of the population of all users of Shampoo prefer brand ‘A’ to the next competing brand. A “new improved” version of ‘A’ has been developed and given to a random sample of 200 shampoo users for use. If 120 of these prefer the “new improved” version to the next competing brand, what should one conclude? For an answer, we would like to know the probability that the sample proportion in a sample of size 200 is as large as 60% or higher when the true population proportion is only 55%, i.e. assuming that the new version is no better than the old. If this probability is quite large, say 0.5, we might conclude that the high sample proportion viz. 60% is perhaps because of sampling errors and the new version is not really superior to the old. On the other hand, if this probability works out to a very small figure, say 0.001, then we might conclude that the true population proportion is higher than 55%, i.e. the new version is actually superior to the old one as perceived by members of the population. To calculate this probability, we need to know the probability distribution of sample proportion or the sampling distribution of the proportion. The sampling distribution, thus, is a distribution of a sample statistic. It is a model of a distribution of scores, like the population distribution, except that the scores are not raw scores, but statistics. It is a thought experiment; "what would the world be like if a person repeatedly took samples of size N from the population distribution and computed a particular statistic each time?" The resulting distribution of statistics is called the sampling distribution of that statistic. For example, suppose that a sample of size sixteen (N=16) is taken from some population. The mean of the sixteen numbers is computed. Next a new sample of sixteen is taken, and the mean is again computed. If this process were repeated an infinite number of times, the distribution of the now infinite number of sample means would be called the sampling distribution of the mean. Similarly, every statistic has a sampling distribution. Just as the population models can be described with parameters, so can the sampling distribution. The expected value (analogous to the mean) of a sampling distribution will be represented here by
  • 18. 17 the symbol µ. The µ symbol is often written with a subscript to indicate which sampling distribution is being discussed. For example, the expected value of the sampling distribution of the mean is represented by the symbol 𝜇𝜇𝑥𝑥̅, that of the median by 𝜇𝜇 𝑀𝑀𝑑𝑑 , etc. The value of 𝜇𝜇𝑥𝑥̅ can be thought of as the mean of the distribution of means. In a similar manner the value of 𝜇𝜇 𝑀𝑀𝑑𝑑 is the mean of a distribution of medians. They are not really means, because it is not possible to find a mean when 𝑁𝑁 = ∞, but are the mathematical equivalent of a mean. Using advanced mathematics, in a thought experiment, the theoretical statistician often discovers a relationship between the expected value of a statistic and the model parameters. For example, it can be proven that the expected value of both the mean and the median, 𝑋𝑋� and Md, is equal to µ x . When the expected value of a statistic equals a population parameter, the statistic is called an unbiased estimator of that parameter. In this case, both the mean and the median would be an unbiased estimator of the parameter 𝜇𝜇𝑥𝑥̅. A sampling distribution may also be described with a parameter corresponding to a variance, symbolized by 𝜎𝜎2 . The square root of this parameter is given a special name, the standard error. Each sampling distribution has a standard error. In order to keep them straight, each has a name tagged on the end of the word "standard error" and a subscript on the σ symbol. The standard deviation of the sampling distribution of the mean is called the standard error of the mean and is symbolized by 𝜎𝜎𝑥𝑥̅. Similarly, the standard deviation of the sampling distribution of the median is called the standard error of the median and is symbolized by 𝜇𝜇 𝑀𝑀𝑑𝑑 . In each case the standard error of a statistics describes the degree to which the computed statistics will differ from one another when calculated from sample of similar size and selected from similar population models. The larger the standard error, the greater the difference between the computed statistics. Consistency is a valuable property to have in the estimation of a population parameter, as the statistic with the smallest standard error is preferred as the estimator of the corresponding population parameter, everything else being equal. Statisticians have proven that in most cases the standard error of the mean is smaller than the standard error of the median. Because of this property, the mean is the preferred estimator of 𝜇𝜇𝑥𝑥. In practice, we refer to the sampling distributions of only the commonly used sampling statistics like the sample mean, sample variance, sample proportion, sample median etc., which have a role in making inferences about the population.
  • 19. 18 2.1 The Sampling Distribution of the Mean There are many (infinite!) possible values of the sample mean and the particular value that we obtain, if we pick up only one sample, is determined only by chance. The distribution of the sample mean is also referred to as the sampling distribution of the mean. However, to observe the distribution of x empirically, we have to take many samples of size n and determine the value of x for each sample. Then, looking at the various observed values of x, it might be possible to get an idea of the nature of the distribution.Such sampling distribution of the mean is known as distribution of sample means. This distribution is described with the parameters 𝜇𝜇𝑥𝑥̅ and 𝜎𝜎𝑥𝑥̅ . Sampling from Infinite Populations Let’s study two cases – 1. Where the population is infinitely large or when the sampling is done with replacement 2. Where the population is finite and we are sampling without replacement For the first scenario let’s assume we have a population which is infinitely large and having a population mean of µ . and a population variance of 𝜎𝜎2 . This implies that if x is a random variable denoting the measurement of the characteristic that we are interested in, on one element of the population picked up randomly, then the expected value of x, E(x) = µ and the variance of x, Var (x) = 𝜎𝜎2 The sample mean, 𝑥𝑥̅ , can be looked at as the sum of n random variables, viz x1, x2,..., xn, each being divided by (1/n). Here x1, is a random variable representing the first observed value in the sample, x2 representing the second observed value and so on. Now, when the population is infinitely large, whatever be the value of x1, the distribution of x2 is not affected by it. This is true of any other pair of random variables as well. In other words x1, x2,..., xn are independent random variables and all are picked up from the same population. ∴ 𝐸𝐸(𝑥𝑥1) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥1) = 𝜎𝜎2 𝐸𝐸(𝑥𝑥2) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥2) = 𝜎𝜎2 … and so on Finally, 𝐸𝐸(𝑥𝑥̅) = 𝐸𝐸 � (𝑥𝑥1+𝑥𝑥2+⋯+𝑥𝑥𝑛𝑛) 𝑛𝑛 �
  • 20. 19 = 1 𝑛𝑛 𝐸𝐸(𝑥𝑥1) + 1 𝑛𝑛 𝐸𝐸(𝑥𝑥2) + ⋯ + 1 𝑛𝑛 𝐸𝐸(𝑥𝑥𝑛𝑛) = 1 𝑛𝑛 𝜇𝜇 + 1 𝑛𝑛 𝜇𝜇 + ⋯ + 1 𝑛𝑛 𝜇𝜇 = 𝜇𝜇 This means that the expected value of the sample mean is the same as the population mean. and Var(𝑥𝑥̅)= 𝑉𝑉𝑉𝑉𝑉𝑉 � 𝑥𝑥1+𝑥𝑥2+⋯+𝑥𝑥𝑛𝑛 𝑛𝑛 � = 𝑉𝑉𝑉𝑉𝑉𝑉 � 𝑥𝑥1 𝑛𝑛 � + 𝑉𝑉𝑉𝑉𝑉𝑉 � 𝑥𝑥2 𝑛𝑛 � + ⋯ + 𝑉𝑉𝑉𝑉𝑉𝑉 � 𝑥𝑥𝑛𝑛 𝑛𝑛 � = 1 𝑛𝑛2 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥1) + 1 𝑛𝑛2 𝑉𝑉𝑉𝑉𝑉𝑉 (𝑥𝑥2) + ⋯ + 1 𝑛𝑛2 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥𝑛𝑛) = 1 𝑛𝑛2 𝜎𝜎2 + 1 𝑛𝑛2 𝜎𝜎2 + ⋯ + 1 𝑛𝑛2 𝜎𝜎2 = 𝜎𝜎2 𝑛𝑛 This says that the variance of the sample mean is the variance of the population divided by the sample size. If we take a large number of samples of size n, then the average value of the sample means tends to be close to the true population mean. On the other hand, if the sample site is increased then the variance of 𝑥𝑥̅ gets reduced and by selecting an appropriately large value of n. the variance of x can be made as small as desired. The standard deviation of 𝑥𝑥̅ is also called the standard error of the mean. Very often we estimate the population mean by the sample mean. The standard error of the mean indicates the extent to which the observed value of sample mean can be away from the true value, due to sampling errors. For example, if the standard error of the mean is small, we are reasonably confident that whatever sample mean value we have observed cannot be very far away from the true value. The standard error of the mean is represented by 𝜎𝜎𝑥𝑥̅. Sampling with replacement The above results have been obtained under the assumption that the random variables 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 are independent. This assumption is valid when the population is infinitely large. It is also valid when
  • 21. 20 the sampling is done with replacement, so that the population is back to the same form before the next sample member is picked up. Hence, if the sampling is done with replacement, we would again have- 𝐸𝐸(𝑥𝑥̅) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥̅) = 𝜎𝜎2 √ 𝑛𝑛 meaning thereby that 𝜎𝜎𝑥𝑥̅ = 𝜎𝜎 √ 𝑛𝑛 Sampling Without Replacement from Finite Populations When a sample is picked up without replacement from a finite population, the probability distribution of the second random variable depends on what has been the outcome of the first pick and so on. As the n random variables representing the n sample members do not remain independent, the expression for the variance of 𝑥𝑥̅ changes. Results of derivation for this situation works out as under- 𝐸𝐸(𝑥𝑥̅) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥̅) = 𝜎𝜎𝑥𝑥 2 = 𝜎𝜎2 𝑛𝑛 . 𝑁𝑁−𝑛𝑛 𝑁𝑁−1 meaning thereby that 𝜎𝜎𝑥𝑥̅ = 𝜎𝜎 √ 𝑛𝑛 . � 𝑁𝑁−𝑛𝑛 𝑁𝑁−1 By comparing these expressions with the ones derived above we find that the standard error of 𝑥𝑥̅ is the same but further multiplied by a factor �(𝑁𝑁 − 𝑛𝑛)/(𝑁𝑁 − 1) . This factor is, therefore, known as the finite population multiplier. In practice, almost all the samples used are picked up without replacement. Also, most populations are finite although they may be very large and so the standard error of the mean should theoretically be found by using the expression given above. However, if the population size (N) is large and consequently the sampling ratio (n/N) small, then the finite population multiplier is close to 1 and is not used, thus treating large finite populations as if they were infinitely large. For example, if N = 5,00,000 and n=500, the finite population multiplier - � 𝑁𝑁−𝑛𝑛 𝑁𝑁−1 = � 5,00,000−500 5,00,000−1 = � 499500 499999 = √0.999002 = 0.9995 which is very close to 1 and the standard error of the mean would, for all practical purposes, be the same whether the population is treated as finite or infinite. As a rule of that, the finite population multiplier may not be used if the sampling ratio (n/N) is smaller than 0.05. Sampling from Normal Populations It has been observed that the normal distribution occurs very frequently among many natural phenomena. For example, heights or weights of individuals, the weights of filled-bags from an automatic machine, the hardness obtained by heat treatment, etc. are distributed normally.
  • 22. 21 It is also known fact that the sum of two independent random variables will follow a normal distribution if each of the two random variables belongs to a normal population. The sample mean, as we have seen earlier is the sum of n random variables 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 each divided by n. Now, if each of these random variables is from the same normal population, it is not difficult to see that 𝑥𝑥̅ would also be distributed normally. Let 𝑥𝑥~𝑁𝑁(𝜇𝜇, 𝜎𝜎2) symbolically represent the fact that the random variable x is distributed normally with mean n and variance 𝜎𝜎2 . Thus, If 𝑥𝑥~𝑁𝑁(𝜇𝜇, 𝜎𝜎2) then it follows that 𝑥𝑥~𝑁𝑁 �𝜇𝜇, 𝜎𝜎 𝑛𝑛 2 � The normal distribution is a continuous distribution and so the population cannot be small and finite if it is distributed normally; that is why the finite population multiplier is not used in the above expression. Let’s see, by an example, how to make use of the above result. Suppose the weight of candy produced on a semi-automatic machine is known to be distributed normally with a mean of 10 mg and a standard deviation of 0.1 mg. If we pick up a random sample of size 5, what is the probability that the sample mean will be between 9.95 mg and 10.05 mg? Let x be a random variable representing the weight of one candy picked up at random. We know that 𝑥𝑥 − 𝑁𝑁( 10, 0.01) Therefore, it follows that 𝑥𝑥̅~ 𝑁𝑁 �10, 0.01 5 � This denots that 𝑥𝑥̅ will be distributed normally with a mean of 10 and a variance which is only 1/5 of the variance of the population, since the sample size is 5. 𝑃𝑃𝑃𝑃 {9.95 ≤ 𝑥𝑥̅ ≤ 10.05} = 2 × Pr{10 ≤ 𝑥𝑥̅ ≤ 10.05} = 2 × Pr � 10−𝜇𝜇 𝜎𝜎 √ 𝑛𝑛� ≤ 𝑥𝑥̅− 𝜇𝜇 𝜎𝜎 √ 𝑛𝑛� ≤ 10.05−𝜇𝜇 𝜎𝜎 √ 𝑛𝑛� � = 2 × Pr �0 ≤ 𝑧𝑧 ≤ 10.05−10 0.1 √5 � � = 2 × Pr{0 ≤ 𝑧𝑧 ≤ 1.12} =2 × 0.3686 = 0.7372
  • 23. 22 Figure 6: Distribution of 𝒙𝒙� the enclosed area represents the probability of the random variable 𝒙𝒙� between 9.95 and 10.05 We first make use of the symmetry of the normal distribution and then calculate the z value by subtracting the mean and then dividing it by the standard deviation of the random variable distributed normally, viz 𝑥𝑥̅. The probability of interest is also shown as the enclosed area in Figure 6 above. 2.2 The Central Limit Theorem The above parameters are closely related to the parameters of the population distribution, with the relationship being described by the Central Limit Theorem. The Central Limit Theorem essentially states that the mean of the sampling distribution of the mean (𝜇𝜇𝑥𝑥̅) equals the mean of the population ( 𝜇𝜇𝑥𝑥) and that the standard error of the mean (𝜎𝜎𝑥𝑥̅) equals the standard deviation of the population ( 𝜎𝜎𝑥𝑥) divided by the square root of N as the sample size gets infinitely larger (𝑁𝑁 ≥ ∞). In addition, the sampling distribution of the mean will approach a normal distribution. These relationships may be summarized as follows: 𝜇𝜇𝑥𝑥= 𝜇𝜇𝑥𝑥̅ and 𝜎𝜎𝑥𝑥= 𝜎𝜎𝑥𝑥 √ 𝑁𝑁 It is observed that the sample size needs to be very large (∞) in order for these relationships to hold true. In theory, this is fact; in practice, an infinite sample size is impossible. In most situations encountered by researchers, the Central Limit Theorem works reasonably well with an N greater than 10 or 20. Thus, it is possible to closely approximate what the distribution of sample means looks like, even with relatively small sample sizes. 9.95 µ=1 10.05 𝜎𝜎𝑥𝑥̅ = 0.1 √5 𝑥𝑥̅ →
  • 24. 23 The importance of the Central Limit Theorem to statistical thinking cannot be overstated. Most of hypothesis testing and sampling theory are based on this theorem. In addition, it provides a justification for using the normal curve as a model for many naturally occurring phenomena. If a trait, such as intelligence, can be thought of as a combination of relatively independent events, in this case both genetic and environmental, then it would be expected that trait would be normally distributed in a population. We need to use the central limit theorem when the population distribution is either unknown or known to be non-normal. If the population distribution is known to be normal, then 𝑥𝑥̅ will also be distributed normally, irrespective of the sample size. 2.3 The Sampling Distribution of the Variance Before attempting to discuss the sampling distribution of the variance, it is worthwhile to first introduce the concept of sample variance and then present the chi-square distribution which helps us in working out probabilities for the sample variance, when the population is distributed normally. The Sample Variance In probability theory and statistics, the variance is a measure of how far a set of numbers is spread out. A variance of zero indicates that all the values are identical. A non-zero variance is always positive: a small variance indicates that the data points tend to be very close to the mean (expected value) and hence to each other, while a high variance indicates that the data points are very spread out from the mean and from each other. We use the sample mean to estimate the population mean, when that parameter is unknown. Similarly , we use a sample statistic called the sample variance to estimate the population variance. The sample variance is usually denoted by 𝑠𝑠2 and it again captures some kind of an average of the square of deviations of the sample values from the sample mean. Let us put it in an equation form 𝑠𝑠2 = ∑ (𝑥𝑥𝑖𝑖−𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 𝑛𝑛−1 By comparing this expression with the corresponding expression for the population variance, we notice two differences. The deviations are measured from the sample mean and not from the population mean and secondly, the sum of squared deviations is divided by (n -1) and not by n. Consequently, we can calculate the sample variance based only on the sample values without knowing the value of any population parameter. The division by (n - 1) is due to a technical reason to make the expected value of s2 equal 𝜎𝜎2 , which it is supposed to estimate.
  • 25. 24 2.4 The Chi-square Distribution The 𝜒𝜒2 distribution is an asymmetric distribution that has a minimum value of 0, but no maximum value. The curve reaches a peak to the right of 0, and then gradually declines in height, the larger the 𝜒𝜒2 value is. The curve approaches, but never quite touches, the horizontal axis. For each degree of freedom there is a different 𝜒𝜒2 distribution. The mean of the chi square distribution is the degree of freedom and the standard devi-ation is twice the degrees of freedom. This implies that the 𝜒𝜒2 distribution is more spread out, with a peak farther to the right, for larger than for smaller degrees of freedom. As a result, for any given level of significance, the critical region begins at a larger chi square value, the larger the degree of freedom. In its graphical represntation the 𝜒𝜒2 value is on the horizontal axis, with the probability for each 𝜒𝜒2 value being represented by the vertical axis. The three lines in the diagram represents the pattern of chi square for degrees of freedom as 1, 5 and 10 respectively. Figure 7: Chi-square distribution with different degrees of freedom If the random variable x has the standard normal distribution, what would be the distribution of 𝜒𝜒2 ? Intuitively speaking, it would be quite different from a normal distribution because now 𝜒𝜒2 , being a squared term, can assume only non-negative values. The probability density of 𝜒𝜒2 will be the highest near 0, because most of the value are close to 0 in a standard normal distribution. This distribution is called the chi-square distribution with 1 degree of freedom. The chi-square distribution has only one parameter viz. the degrees of freedom and so there are many chi-square distributions each with its own degrees of freedom. In statistical tables, chi-square values for different are.as under the right tail and the left tail of various chi-square distributions are tabulated.
  • 26. 25 If 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 are independent random variables, each having a standard normal distribution, then 𝑥𝑥1 + 𝑥𝑥2 + ⋯ + 𝑥𝑥𝑛𝑛 will have a chi-square distribution with n degrees of freedom. If 𝑦𝑦1 and 𝑦𝑦2 are independent random variables having chi-square distributions with 𝛾𝛾1 and 𝛾𝛾2 degrees of freedom, then (𝑦𝑦1 + 𝑦𝑦2) will have a chi-square distribution with 𝛾𝛾1 + 𝛾𝛾2 degrees of freedom. Further, if 𝑦𝑦1 and 𝑦𝑦2 are independent random variables such that 𝑦𝑦1 has a chi-square distribution with 𝛾𝛾1 degrees of freedom and (𝑦𝑦1 + 𝑦𝑦2) has a chi-square distribution with 𝛾𝛾 > 𝛾𝛾1 degrees of freedom, then 𝑦𝑦2 will have a chi-square distribution with (𝛾𝛾 − 𝛾𝛾1) degrees of freedom. Now, if 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 are n random variables from a normal population with mean 𝜇𝜇 and variance 𝜎𝜎2 , i.e. 𝑥𝑥𝑖𝑖~𝑁𝑁(𝜇𝜇, 𝜎𝜎2), 𝑖𝑖 = 1,2, … , 𝑛𝑛 it implies that 𝑥𝑥𝑖𝑖−𝜇𝜇 𝜎𝜎 ~𝑁𝑁(0,1) and so � 𝑥𝑥𝑖𝑖−𝜇𝜇 𝜎𝜎 � 2 will have a chi-square distribution with 1 degree of freedom. Hence, ∑ � 𝑥𝑥𝑖𝑖−𝜇𝜇 𝜎𝜎 � 2 𝑛𝑛 𝑖𝑖=1 will have a chi-square distribution with n degrees of freedom. We can break up this expression by measuring the deviation from 𝑥𝑥 in place of 𝜇𝜇. We will then have ∑ � 𝑥𝑥𝑖𝑖−𝜇𝜇 𝜎𝜎 � 2 𝑛𝑛 𝑖𝑖=1 = 1 𝜎𝜎2 ∑ [(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅) + (𝑥𝑥̅ − 𝜇𝜇)]2𝑛𝑛 𝑖𝑖=1 = 1 𝜎𝜎2 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛 𝑖𝑖=1 + 1 𝜎𝜎2 ∑ (𝑥𝑥̅ − 𝜇𝜇)2𝑛𝑛 𝑖𝑖=1 + 2(𝑥𝑥̅−𝜇𝜇) 𝜎𝜎2 ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)𝑛𝑛 𝑖𝑖=1 = (𝑛𝑛−1)𝑠𝑠2 𝜎𝜎2 + � 𝑥𝑥̅− 𝜇𝜇 𝜎𝜎 √ 𝑛𝑛� � 2 since ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)𝑛𝑛 𝑖𝑖=1 = 0 Now, it is known that the LHS of the above equation is a random variable which has a chi-square distribution with n degrees of freedom. It is also known that – 𝑥𝑥̅~ 𝑁𝑁 �𝜇𝜇. 𝜎𝜎2 𝑛𝑛 � ∴ � 𝑥𝑥̅− 𝜇𝜇 𝜎𝜎 √ 𝑛𝑛� � 2 will have a chi-square distribution with 1 degree of freedom. Hence, if the two terms on the right hand side of the above equation are independent (which will be assumed as true here), then it follows that (𝑛𝑛−1) 𝑠𝑠2 𝜎𝜎2 has a chi-square distribution with (n — 1) degrees of freedom. One degree of freedom is lost because the deviations are measured from 𝑥𝑥̅ and not from 𝜇𝜇. Expected Value and Variance of 𝒔𝒔𝟐𝟐
  • 27. 26 The mean of a chi-square distribution is equal to its degrees of freedom and the variance is equal to twice the degrees of freedom. This can be used to find the expected value and the variance of 𝒔𝒔𝟐𝟐 . Since (𝑛𝑛−1) 𝑠𝑠2 𝜎𝜎2 has a chi-square distribution with (n-1) degrees of freedom, ∴ 𝐸𝐸 � (𝑛𝑛−1)𝑠𝑠2 𝜎𝜎2 � = 𝑛𝑛 − 1 Or (𝑛𝑛−1) 𝜎𝜎2 . 𝐸𝐸 (𝑠𝑠2) = 𝑛𝑛 − 1 ∴ 𝐸𝐸 (𝑠𝑠2) = 𝜎𝜎2 Also, Var � (𝑛𝑛−1)𝑠𝑠2 𝜎𝜎2 � = 2(𝑛𝑛 − 1) Using the definition of Variance, we get 𝐸𝐸 � (𝑁𝑁−1)𝑆𝑆2 𝜎𝜎2 − 𝐸𝐸 � (𝑁𝑁−1)𝑆𝑆2 𝜎𝜎2 �� 2 = 2(𝑁𝑁 − 1) Or, 𝐸𝐸 � (𝑛𝑛−1)𝑠𝑠2 𝜎𝜎2 − (𝑛𝑛 − 1)� 2 − 2(𝑛𝑛 − 1) Or, (𝑛𝑛−1)2 𝜎𝜎4 𝐸𝐸 (𝑠𝑠2 − 𝜎𝜎2)2 = 2(𝑛𝑛 − 1) ∴ 𝐸𝐸 (𝑠𝑠2 − 𝜎𝜎2)2 = 2𝜎𝜎4 (𝑛𝑛−1) i.e 𝑉𝑉𝑉𝑉𝑉𝑉 (𝑠𝑠2) = 2𝜎𝜎4 𝑛𝑛−1 since the expected value of 𝑠𝑠2 is equal to 𝜎𝜎2 . It can therefore, be conclude that if we take a large number of samples, each with a sample size on n, from a normal population with mean 𝜇𝜇 and variance 𝜎𝜎2 , each sample will perhaps have a different value for its sample variance 𝑠𝑠2 . But the average of a large number of values of 𝑠𝑠2 will be close to 𝜎𝜎2 . Also, the variance of 𝑠𝑠2 falls as the sample size increases. Its important to note here that all the above conclusions are based on the assumption that the population is distributed normally. If the population does not have a normal distribution, then nothing can be said about the distribution of 𝑠𝑠2 . 2.5 Sampling Distribution of the Proportion Let us assume that 0.80 of all students in a school can pass a test of physical fitness. A random sample of 20 students is chosen: 13 passed and 7 failed. The parameter π is used to designate the proportion of subjects in the population that pass (0.80 in this case) and the statistic p is used to designate the proportion who pass in a sample (13/20= 0.65 in this case). The sample size (N) in this example is 20. If repeated samples of size N where taken from the population and the proportion passing (p) were determined for each sample, a distribution of values of p would be formed. If the sampling went on forever, the distribution would be the sampling distribution of a proportion. The sampling distribution of a proportion is equal to the binomial distribution. The mean and standard deviation of the binomial distribution are: 𝜇𝜇 = 𝜋𝜋 and 𝜎𝜎𝑝𝑝 = � 𝜋𝜋(1−𝜋𝜋) 𝑁𝑁 For the present example, N = 20, π = 0.80, the mean of the sampling distribution of 𝑝𝑝(𝜇𝜇) is 0.8 and the standard error of 𝑝𝑝�𝜎𝜎𝑝𝑝� is 0.089. The shape of the binomial distribution depends on both N and
  • 28. 27 π. With large values of N and values of π in the neighborhood of 0.5, the sampling distribution is very close to a normal distribution. Assume that for the population of people applying for a job at a bank in a major city, 0.40 are able to pass a basic literacy test required to get the job. Out of a group of 20 applicants, what is the probability that 50% or more of them will pass? This problem involves the sampling distribution of p with π = 0.40 and N = 20. The mean of the sampling distribution is π = 0.40. The standard deviation is: 𝜎𝜎𝑝𝑝 = � 𝜋𝜋(1−𝜋𝜋) 𝑁𝑁 = � 0.40(1−0.40) 20 = 0.11 Using the normal approximation, a proportion of 0.50 is: (0.50-0.40)/0.11 = 0.909 standard deviations above the mean. From a z table it can be calculated that 0.818 of the area is below a z of 0.909. Therefore the probability that 50% or more will pass the literacy test is only about 1 - 0.818 = 0.182. 2.6 The Confidence Level The sample mean is researchers estimate of the population mean. If we are asked to give an interval as our estimate, then we would add a range on the upper and the lower side of the sample mean and give that interval as our estimate. The larger the interval, the greater is our confidence that the interval does contain the true population mean. It is to be noted that the true population mean is a constant and is not a variable. On the other hand, the interval that we specify is a random interval whose position depends on the sample mean. For example if the sample mean is 50 and the standard error of the mean is 5, we may specify our interval estimate as (45,55) i.e. from 45 to 55 which spans one standard error of the mean on either side of the sample mean. On the other hand, if the interval estimate is specified as (40,60) i.e. spanning two standard errors of the mean on either side of the sample mean, we are more confident that the latter interval contains the true population mean as compared to the former. However, if the confidence level is raised too high, the corresponding interval may become too wide to be of any practical use. The confidence level, therefore, may be defined as the probability that the interval estimate will contain the true value of the population parameter that is being estimated. If we say that a 95% confidence interval for the population mean is obtained by spanning 1.96 times the standard error of the mean on either side of the sample mean, we mean that we take a large number of samples of size n, say 1000, and obtain the interval estimates from each of these 1000 samples and then 95% of these interval estimate would contain the true population mean. Confidence Interval for the Population Mean Let us now discuss how to obtain a confidence interval for the population mean. We shall assume that the population distribution is normal and that the population variance is known. Later, we shall relax the second condition. Suppose it is known that the weight of cement in packed bags is distributed normally with a standard deviation of 0.2 Kg. A sample of 25 bags is picked up at random and the mean weight of cement in
  • 29. 28 these 25 bags is only 49.7 Kg. We want to find a 90% confidence interval for the mean weight of cement in filled bags. Let x be a random variable representing the weight of cement in a bag picked up at random. We know that x is distributed normally with a standard deviation of 0.2 Kg. The standard error of the mean can be easily calculated as 𝜎𝜎𝑥𝑥̅ = 𝜎𝜎 √ 𝑛𝑛 = 0.2 √25 = 0.04 𝐾𝐾𝐾𝐾 We can use the above approach when the population standard deviation is known or when the sample size is large n > 30 , in which case the sample standard deviation can be used as an estimate of the population standard deviation. However, if the sample size is not large, as in the example above, then one has to use the t distribution in place of the standard normal distribution to calculate the probabilities. Let us assume that we are interested in developing a 90% confidence interval in the same situation as described earlier with the difference that the population standard deviation is now not known. However, the sample standard deviation has been calculated and is known to be 0.2 Kg. Since the sample size n = 25, we know that 𝑥𝑥̅− 𝜇𝜇 𝑠𝑠 √ 𝑛𝑛� follows a t-distribution with 24 degrees of freedom. From t-tables, we can see that the probability that a t statistic with 24 degrees of freedom lying between - 1.711 and 1.711 is 0.90 -i.e. the probability that 𝑥𝑥̅ lies between −1.711 𝑠𝑠 √𝑛𝑛⁄ and +1.711 𝑠𝑠 √𝑛𝑛⁄ is 0.90. In other words, if we use an interval spanning from (𝑥𝑥 − 1.711 𝑠𝑠 √𝑛𝑛⁄ ) to (𝑥𝑥 + 1.711 𝑠𝑠 √𝑛𝑛⁄ ) then 90% of the time, this interval will contain 𝜇𝜇 . Hence, for a 90% confidence interval, The lower limit = 𝑥𝑥̅ − 1.711 𝑠𝑠 √ 𝑛𝑛 or 49.7 − 1.711 0.2 √25 or 49.6316 And the upper limit = 𝑥𝑥̅ + 1.711 𝑠𝑠 √ 𝑛𝑛 or 49.7 + 1.711 0.2 √25 or 49.7684 In this case, we can state with 90% confidence level that the mean weight of cement in a filled bag lies between 49.6316 Kg and 49.7684 Kg. Using the derivations and relations we can calculte the sample size that will be ideal for a particular study for an expected confidence level. ***
  • 30. 29 Bibliography 1. http://www.nku.edu/~statistics/212_Sampling_Distribution_of_P-hat.htm 2. http://en.wikipedia.org/wiki/Sampling_distribution 3. http://en.wikipedia.org/wiki/Sampling_(statistics) 4. http://onlinestatbook.com 5. Course material on ‘Quantitative analysis for Managerial Applications’, MS-8, 1997, IGNOU, Maidan Garhi, New Delhi. 6. Course material on ‘Research Methodology for Management Decisions’, MS-95, 1997, IGNOU, Maidan Garhi, New Delhi. 7. http://stattrek.com/sampling/sampling-distribution.aspx 8. http://www.psychstat.missouristate.edu/introbook/sbk19.htm 9. http://www.stat.berkeley.edu/~stark/SticiGui/Text/index.htm 10. http://www.fao.org/docrep/w7295e/w7295e08.htm#6 ***