C
DATA SAMPLING
• Data sampling is a statistical analysis technique used to select,
manipulate and analyze a representative subset of data points to
identify patterns and trends in the larger data set being examined. It
enables data scientists, predictive modelers and other data analysts to
work with a small, manageable amount of data about a
statistical population to build and run analytical models more quickly,
while still producing accurate findings.
• Sampling can be particularly useful with data sets that are too large to
efficiently analyze in full -- for example, in big data
analytics applications or surveys, Identifying and analyzing a
representative sample is more efficient and cost-effective than
surveying the entirety of the data or population.
Populations and Samples
• Population: Population is the group of elements which has
common characteristics. It is a collection of observations
about which we would like to make inferences.
• Sample: A sample is the subset of population
• Sampling: a collection of samples from the population is a
sampling. In other words, sampling units are an overlapping
collection of elements from the population.
• An important consideration, though, is the size of the required
data sample and the possibility of introducing a sampling error. In
some cases, a small sample can reveal the most important
information about a data set. In others, using a larger sample can
increase the likelihood of accurately representing the data as a
whole, even though the increased size of the sample may impede
ease of manipulation and interpretation.
Sampling Error
• Sampling error is the deviation between the estimate of an ideal
sample and the true population.
• The core assumption of data sampling is that samples are a
subset of the population, and the sample mean is equal to the
mean of the population.
• To the degree that doesn’t happen is the term Sampling Error
• We can reduce sampling error by following sampling best
practices, like having a large enough sample size, choosing the
right kind of sampling to do, and avoiding sampling bias.
Data Sampling Methods
When taking a sample from a larger population you must
make sure that the samples are an appropriate size and
without bias.
There are two types of sampling
• Probability sampling
• Non-probability sampling
Probability Sampling:
Every element in the sample population has an equal chance of
being selected. A sampling method is biased if every member of
the population doesn’t have equal likelihood of being in the
sample.
Different types of probability sampling
• Simple random sampling
• Stratified sampling
• Systematic sampling
• Cluster sampling
Simple random sampling:
• It is a method of sampling in which every element of the
universe has equal probability of being chosen. For example,
choose an individual from a lottery. The advantage of this
method is free from personal bias, and the universe gets fairly
represented by samples.
Stratified sampling:
• The population is broken down into non-overlapping groups. In other
words, strata (elements within the subgroups are homogenous or
heterogeneous). Then random samples are taken from each strata, so
that entire population gets represented. The advantage of this method is
it covers all the elements of the population. But there is a possibility of
bias at the time of classification of population.
Systematic sampling:
• Samples are selected from the population according to a pre
determined rule. In other words, every nth element selected from
the population as a sample. Arrange all the elements in a
sequence and then select the samples from the population at
regular intervals.
Cluster sampling:
• The population is broken down into many different clusters, and
then clusters or subgroups are randomly selected. For example,
clusters are of different ages, sex, locations etc.
Different types of non-probability
sampling
• Purposive sampling
• Convenience sampling
• Quota sampling
• Snowball/referral sampling
Purposive sampling:
• Purposive sampling is also
known as judgment sampling.
Samples are selected based on
the purpose or intention of
research. The method is flexible
to allow the inclusion of those
items in the sample which are
of special significance.
Convenience sampling:
• Convenience sampling is
one of the easiest
sampling methods.
Samples selection is
based on availability and
selecting the samples that
are convenient to the
researcher.
Quota sampling:
• It is one type of stratified
sampling, where samples
are collected in each
subgroup until the desired
quota is met. The
proportion of this sample
does not match the
proportion of the group to
the population.
Snowball/referral sampling:
• Snowball sampling or referral
sampling is the method famous in
medical and social science surveys
where the population is unknown
and difficult to get the sample. Hence
researchers will take help from the
existing elements to refer the others
as samples who can fit in the
population. Since it is based on
referrals, there is a chance of bias.
Kinds of Sampling Bias
Sampling bias is a bias in which samples are collected in such a
way that some elements of the intended population have less or
more sampling probability than the others.
Following are the different types of sampling bias
• Response Bias: A response or data bias is a systematic bias that
occurs during data collection that influences the response.
• Voluntary response Bias: Occurs when individuals can chose to
participate.
• Non response Bias: Non response bias occurs when units
selected as part of the sampling procedure do not respond in
whole or part.
• Convenience Bias: When sample is taken from individuals that
are conveniently available.

Data sampling.pptx

  • 1.
  • 2.
    • Data samplingis a statistical analysis technique used to select, manipulate and analyze a representative subset of data points to identify patterns and trends in the larger data set being examined. It enables data scientists, predictive modelers and other data analysts to work with a small, manageable amount of data about a statistical population to build and run analytical models more quickly, while still producing accurate findings. • Sampling can be particularly useful with data sets that are too large to efficiently analyze in full -- for example, in big data analytics applications or surveys, Identifying and analyzing a representative sample is more efficient and cost-effective than surveying the entirety of the data or population.
  • 3.
    Populations and Samples •Population: Population is the group of elements which has common characteristics. It is a collection of observations about which we would like to make inferences. • Sample: A sample is the subset of population • Sampling: a collection of samples from the population is a sampling. In other words, sampling units are an overlapping collection of elements from the population.
  • 4.
    • An importantconsideration, though, is the size of the required data sample and the possibility of introducing a sampling error. In some cases, a small sample can reveal the most important information about a data set. In others, using a larger sample can increase the likelihood of accurately representing the data as a whole, even though the increased size of the sample may impede ease of manipulation and interpretation.
  • 5.
    Sampling Error • Samplingerror is the deviation between the estimate of an ideal sample and the true population. • The core assumption of data sampling is that samples are a subset of the population, and the sample mean is equal to the mean of the population. • To the degree that doesn’t happen is the term Sampling Error • We can reduce sampling error by following sampling best practices, like having a large enough sample size, choosing the right kind of sampling to do, and avoiding sampling bias.
  • 6.
    Data Sampling Methods Whentaking a sample from a larger population you must make sure that the samples are an appropriate size and without bias. There are two types of sampling • Probability sampling • Non-probability sampling
  • 7.
    Probability Sampling: Every elementin the sample population has an equal chance of being selected. A sampling method is biased if every member of the population doesn’t have equal likelihood of being in the sample. Different types of probability sampling • Simple random sampling • Stratified sampling • Systematic sampling • Cluster sampling
  • 8.
    Simple random sampling: •It is a method of sampling in which every element of the universe has equal probability of being chosen. For example, choose an individual from a lottery. The advantage of this method is free from personal bias, and the universe gets fairly represented by samples.
  • 9.
    Stratified sampling: • Thepopulation is broken down into non-overlapping groups. In other words, strata (elements within the subgroups are homogenous or heterogeneous). Then random samples are taken from each strata, so that entire population gets represented. The advantage of this method is it covers all the elements of the population. But there is a possibility of bias at the time of classification of population.
  • 10.
    Systematic sampling: • Samplesare selected from the population according to a pre determined rule. In other words, every nth element selected from the population as a sample. Arrange all the elements in a sequence and then select the samples from the population at regular intervals.
  • 11.
    Cluster sampling: • Thepopulation is broken down into many different clusters, and then clusters or subgroups are randomly selected. For example, clusters are of different ages, sex, locations etc.
  • 12.
    Different types ofnon-probability sampling • Purposive sampling • Convenience sampling • Quota sampling • Snowball/referral sampling
  • 13.
    Purposive sampling: • Purposivesampling is also known as judgment sampling. Samples are selected based on the purpose or intention of research. The method is flexible to allow the inclusion of those items in the sample which are of special significance.
  • 14.
    Convenience sampling: • Conveniencesampling is one of the easiest sampling methods. Samples selection is based on availability and selecting the samples that are convenient to the researcher.
  • 15.
    Quota sampling: • Itis one type of stratified sampling, where samples are collected in each subgroup until the desired quota is met. The proportion of this sample does not match the proportion of the group to the population.
  • 16.
    Snowball/referral sampling: • Snowballsampling or referral sampling is the method famous in medical and social science surveys where the population is unknown and difficult to get the sample. Hence researchers will take help from the existing elements to refer the others as samples who can fit in the population. Since it is based on referrals, there is a chance of bias.
  • 17.
    Kinds of SamplingBias Sampling bias is a bias in which samples are collected in such a way that some elements of the intended population have less or more sampling probability than the others. Following are the different types of sampling bias • Response Bias: A response or data bias is a systematic bias that occurs during data collection that influences the response.
  • 18.
    • Voluntary responseBias: Occurs when individuals can chose to participate. • Non response Bias: Non response bias occurs when units selected as part of the sampling procedure do not respond in whole or part. • Convenience Bias: When sample is taken from individuals that are conveniently available.