D
BASIC OF STATISTICAL INFERENCE PART-I
1www.dexlabanalytics.com
D
CLASSICAL SAMPLING THEORY
2www.dexlabanalytics.com
3
Copyright © 2020, DexLab Solutions Corporation
BASIC BUILDING BLOCKS OF CLASSICAL SAMPLING THEORY
SAMPLE
The base of all the observations which are eligible to be
studied to address key questions relating to a statistical
investigation, irrespective of whether it can be accessed or
not. In real time the entire population is always unknown
since there is a part of the population which cannot be
accessed due to different reasons such as: Data Archiving
Problems, Data permissions etc.
A representative subset of the population is called a sample.
The distribution of the variables in the sample is used to form
an idea about the respective distribution of the variables in
the population. For example: In a credit card portfolio, the
average utilisation of the development and validation sample
must be a rough approximation of the population (i.e.
Development + Validation)
WHY IS A MODEL ALWAYS DEVELOPED ON A SAMPLE EVEN WHEN THE WHOLE DATA IS AVAILABLE?
Population is never accessible to the developers even if they
have all the master files required for model development. The
reasons are:
1. Observation Exclusions: All the observations may not be
needed for developing a model, because they are not
relevant for the development.
2. Variable Exclusions: All the variables may not be needed
for developing a model, because they are irrelevant for
development.
For Model Development, two fold validation is required to
ensure model robustness: 1.Insample Validation sample 2.
Out of time Sample Validation. If the model is applied on a
data which is in the same period but contains mutually
exclusive accounts, then how does it perform? (In sample
Validation Sample). If the model is applied on a data of a
mutually exclusive time period then how does it perform?
(Out of Time Validation samples). These samples are created
from the population. Hence, the full population is never
accessible.
Introduction
4
Types of Methods Sampling
Copyright © 2020, DexLab Solutions Corporation
RANDOM SAMPLING
The sampling technique in which each observation has an equal probability
of being chosen.
PURPOSIVE SAMPLING
A sample which is selected on the basis of the individuals judgement of
the sampler.
SIMPLE RANDOM SAMPLING
The selection process of a group of units in such a manner that every unit of
the population has an “equal chance” of being included in the sample.
STRATIFIED SAMPLING
The population is subdivided into the several categories, called strata; and
then a sub-sample is chosen from each of them. All the sub-samples
combined together give the stratified sample. If the selection from strata is
done by random sampling,.
SIMPLE RANDOM SAMPLING WITH REPLACEMENT
The ‘n’ numbers of the sample are drawn from the population in such a way that at
each drawing, each of the ‘n’ numbers of the population gets the same probability of
being selected.
SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT
The ‘n’ numbers of the sample are drawn one by one but the members once drawn are
not returned back to the population and at each stage remaining amount of the
population is given the same probability of being includes in the sample.
5
Example of Types of Methods Sampling
Copyright © 2020, DexLab Solutions Corporation
RANDOM SAMPLING
Random sampling is done when you do not want to assign a very specific observation you want
to treat all your observation as equal and you just want to randomly sample the amount.
Because when you developing a risk model you have to be essentially very impartial person and
you cannot assign arbitrary ways or purposive ways to any records because that may be a bias
the amount of capital that you keep so the random sampling for a domain like risk where
impartial amount of capital have assign, to ensure that solvency of the banks are related to
many people from different groups, so the sampling which is to be done independent of any
thing.
PURPOSIVE SAMPLING
Purposive sampling is done when you are into a research problem which is going to affect
only specific class of the people, let’s say you want to test of the effectiveness of drug on
the diabetes disease. Then you need to pull out those observations from the population.
Such that they have a disease at the specific order, i.e. it’s a kind of a subset, so you need
to have certain parameters required in the patients. That is the case which is purposively
done.
SIMPLE RANDOM SAMPLING
In a simple random sampling from the overall data (whatever record present in data) for those
records sample are chosen randomly. However it is a very possible case to represent a particular
portfolio there can be certain matrices which are must appear in respective samples as they
appear in the population. Let’s say there can be certain product distributions in the portfolio
which need to be capture the sample level. So, there can be multiple different categorical
variables which needed to captured in different ratio as the sample level have in the population.
STRATIFIED RANDOM SAMPLING
A market survey by a company increased in branching into a new market might choose a
population of people using similar products, stratify it by brand and sampling from each
stratum.
SIMPLE RANDOM SAMPLING WITH REPLACE MENT
Let us consider a lot of 5 watch's with 3 good and 2 defective key board, if any two watch are
selected with replacement so,
There are 25 possible samples. The selected sample will be any one of the 25 possible samples.
Each sample has an equal probability 1/25 of selection.
SIMPLE RANDOM SAMPLING WITHOUT REPLACE MENT
Let us consider a lot (Population) of 5 watch’s with 3 good and 2 defective watch, suppose
we have to select two watch in any order,
There are 10 possible combinations of samples, Each of them has a probability of selection
equal to 1/10.
6
Sampling Distribution
Copyright © 2020, DexLab Solutions Corporation
SAMPLING FLUCTUATION
A particular sample drawn from a
given population includes different
set of population members.
Therefore, the value of the
estimator itself is lion to vary from
one sample to another. These
differences in the values of the
estimator are called the sampling
fluctuations of an estimator.
It may be defined as the probability law which the statistic follows , if repeated random samples of a
fixed size are drawn from specified population. A number of samples, each of size n, are taken from
the same population and if for each sample the values of the statistic is calculated, a series of values
of the statistic will be obtained. If the number of samples is large, these may be arranged into a
frequency table. The frequency distribution of the statistic that would be obtained if the number of
samples, each of the same size (say n), were infinite is called the Sampling distribution of the statistic.
Sampling distribution of sample mean
If x bar represents the mean of a random sample
of size n, drawn from a population with mean µ
and standard deviation σ, then the sampling
distribution of x bar is approximately a normal
distribution with Mean = µ and Standard
deviation = standard error of x bar. Provided the
sample size n is sufficiently large.
Sampling distribution of
sample proportion
If p represents the population of
defectives in a random sample of size n
drawn from a lot with proportion of
defective P, then the sampling
distribution of p is approximately a
normal distribution with
Mean = P and Standard deviation =
standard error of p,
Provide the sample size n is sufficiently
large.
Standard error of a static is the standard deviation calculated from the sampling distribution of the statistic .A
sampling distribution may have its mean, standard deviation and moments of higher orders. Of particular
importance is the standard deviation, which is designed as the standard error of the statistic. The mean of a
statistic will generally be the corresponding parameter, exactly or approximately. The standard error of the
gives an idea of the average error that one would commit is using the value of the statistic in lieu of the
parameter. It is illustrated as a case of random sampling the means and standard error of a sample mean and a
sample proportion. Some people prefer to use 0.6745 times the standard error, which is called probable error
of the statistic. The relevance of the probable error stems from the fact that for a normally distributed variable
x with mean µ and standard deviation σ,
P[µ - 0.6745σ <= x <= µ + 0.6745σ] = 0.50 , Approximately.
PARAMETER
Parameter is an unknown
numerical factor of the population.
The primary interest of any survey
lies in knowing the values of
different measures of the
population distribution of a
variable of interest. The measures
of population distribution involves
its mean, standard deviation etc.
which is calculated on the basis of
the population values of the
variable. In other words, the
parameter is a functional form of
all the population unit.
ESTIMATOR
An estimator is a measure
computed on the basis of sample
values. It is a functional from of all
sample observe prorating a
representative value of the
collected sample.
STATISTIC
Any statistical measure calculated
on the basis of sample
observations is called Statistic. Like
sample mean, sample standard
deviation, etc. Sample statistic are
always known to us.
Serial No. Mean
Standard
Deviation
Frequency
Relative
Frequency
Sample01 4456.25 3288.622 1 5.00%
Sample02 4617.25 2566.8 2 10.00%
Sample03 2006.25 754.08 1 5.00%
Sample04 4617.25 2566.8 1 5.00%
Sample05 2006.25 754.08 2 10.00%
Sample06 5553.75 4924.163 1 5.00%
Sample07 2006.25 754.08 3 15.00%
Sample08 4617.25 2566.8 3 15.00%
Sample09 2006.25 754.08 4 20.00%
Sample10 5553.75 4924.163 2 10.00%
Total 37440.5 23853.67 20 100.00%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
RelativeFrequencies
Mean
Relative Frequency
K 3/5, DLF Phase 2, Gurgaon, Haryana – 122 002.
hello@dexlabanalytics.com
+91 124 450 2444; +91 124 488 8144
+91 931 572 5902; +91 8527 872 444
www.dexlabanalytics.com
7
CONTACTUS

Basic of Statistical Inference Part-I

  • 1.
    D BASIC OF STATISTICALINFERENCE PART-I 1www.dexlabanalytics.com
  • 2.
  • 3.
    3 Copyright © 2020,DexLab Solutions Corporation BASIC BUILDING BLOCKS OF CLASSICAL SAMPLING THEORY SAMPLE The base of all the observations which are eligible to be studied to address key questions relating to a statistical investigation, irrespective of whether it can be accessed or not. In real time the entire population is always unknown since there is a part of the population which cannot be accessed due to different reasons such as: Data Archiving Problems, Data permissions etc. A representative subset of the population is called a sample. The distribution of the variables in the sample is used to form an idea about the respective distribution of the variables in the population. For example: In a credit card portfolio, the average utilisation of the development and validation sample must be a rough approximation of the population (i.e. Development + Validation) WHY IS A MODEL ALWAYS DEVELOPED ON A SAMPLE EVEN WHEN THE WHOLE DATA IS AVAILABLE? Population is never accessible to the developers even if they have all the master files required for model development. The reasons are: 1. Observation Exclusions: All the observations may not be needed for developing a model, because they are not relevant for the development. 2. Variable Exclusions: All the variables may not be needed for developing a model, because they are irrelevant for development. For Model Development, two fold validation is required to ensure model robustness: 1.Insample Validation sample 2. Out of time Sample Validation. If the model is applied on a data which is in the same period but contains mutually exclusive accounts, then how does it perform? (In sample Validation Sample). If the model is applied on a data of a mutually exclusive time period then how does it perform? (Out of Time Validation samples). These samples are created from the population. Hence, the full population is never accessible. Introduction
  • 4.
    4 Types of MethodsSampling Copyright © 2020, DexLab Solutions Corporation RANDOM SAMPLING The sampling technique in which each observation has an equal probability of being chosen. PURPOSIVE SAMPLING A sample which is selected on the basis of the individuals judgement of the sampler. SIMPLE RANDOM SAMPLING The selection process of a group of units in such a manner that every unit of the population has an “equal chance” of being included in the sample. STRATIFIED SAMPLING The population is subdivided into the several categories, called strata; and then a sub-sample is chosen from each of them. All the sub-samples combined together give the stratified sample. If the selection from strata is done by random sampling,. SIMPLE RANDOM SAMPLING WITH REPLACEMENT The ‘n’ numbers of the sample are drawn from the population in such a way that at each drawing, each of the ‘n’ numbers of the population gets the same probability of being selected. SIMPLE RANDOM SAMPLING WITHOUT REPLACEMENT The ‘n’ numbers of the sample are drawn one by one but the members once drawn are not returned back to the population and at each stage remaining amount of the population is given the same probability of being includes in the sample.
  • 5.
    5 Example of Typesof Methods Sampling Copyright © 2020, DexLab Solutions Corporation RANDOM SAMPLING Random sampling is done when you do not want to assign a very specific observation you want to treat all your observation as equal and you just want to randomly sample the amount. Because when you developing a risk model you have to be essentially very impartial person and you cannot assign arbitrary ways or purposive ways to any records because that may be a bias the amount of capital that you keep so the random sampling for a domain like risk where impartial amount of capital have assign, to ensure that solvency of the banks are related to many people from different groups, so the sampling which is to be done independent of any thing. PURPOSIVE SAMPLING Purposive sampling is done when you are into a research problem which is going to affect only specific class of the people, let’s say you want to test of the effectiveness of drug on the diabetes disease. Then you need to pull out those observations from the population. Such that they have a disease at the specific order, i.e. it’s a kind of a subset, so you need to have certain parameters required in the patients. That is the case which is purposively done. SIMPLE RANDOM SAMPLING In a simple random sampling from the overall data (whatever record present in data) for those records sample are chosen randomly. However it is a very possible case to represent a particular portfolio there can be certain matrices which are must appear in respective samples as they appear in the population. Let’s say there can be certain product distributions in the portfolio which need to be capture the sample level. So, there can be multiple different categorical variables which needed to captured in different ratio as the sample level have in the population. STRATIFIED RANDOM SAMPLING A market survey by a company increased in branching into a new market might choose a population of people using similar products, stratify it by brand and sampling from each stratum. SIMPLE RANDOM SAMPLING WITH REPLACE MENT Let us consider a lot of 5 watch's with 3 good and 2 defective key board, if any two watch are selected with replacement so, There are 25 possible samples. The selected sample will be any one of the 25 possible samples. Each sample has an equal probability 1/25 of selection. SIMPLE RANDOM SAMPLING WITHOUT REPLACE MENT Let us consider a lot (Population) of 5 watch’s with 3 good and 2 defective watch, suppose we have to select two watch in any order, There are 10 possible combinations of samples, Each of them has a probability of selection equal to 1/10.
  • 6.
    6 Sampling Distribution Copyright ©2020, DexLab Solutions Corporation SAMPLING FLUCTUATION A particular sample drawn from a given population includes different set of population members. Therefore, the value of the estimator itself is lion to vary from one sample to another. These differences in the values of the estimator are called the sampling fluctuations of an estimator. It may be defined as the probability law which the statistic follows , if repeated random samples of a fixed size are drawn from specified population. A number of samples, each of size n, are taken from the same population and if for each sample the values of the statistic is calculated, a series of values of the statistic will be obtained. If the number of samples is large, these may be arranged into a frequency table. The frequency distribution of the statistic that would be obtained if the number of samples, each of the same size (say n), were infinite is called the Sampling distribution of the statistic. Sampling distribution of sample mean If x bar represents the mean of a random sample of size n, drawn from a population with mean µ and standard deviation σ, then the sampling distribution of x bar is approximately a normal distribution with Mean = µ and Standard deviation = standard error of x bar. Provided the sample size n is sufficiently large. Sampling distribution of sample proportion If p represents the population of defectives in a random sample of size n drawn from a lot with proportion of defective P, then the sampling distribution of p is approximately a normal distribution with Mean = P and Standard deviation = standard error of p, Provide the sample size n is sufficiently large. Standard error of a static is the standard deviation calculated from the sampling distribution of the statistic .A sampling distribution may have its mean, standard deviation and moments of higher orders. Of particular importance is the standard deviation, which is designed as the standard error of the statistic. The mean of a statistic will generally be the corresponding parameter, exactly or approximately. The standard error of the gives an idea of the average error that one would commit is using the value of the statistic in lieu of the parameter. It is illustrated as a case of random sampling the means and standard error of a sample mean and a sample proportion. Some people prefer to use 0.6745 times the standard error, which is called probable error of the statistic. The relevance of the probable error stems from the fact that for a normally distributed variable x with mean µ and standard deviation σ, P[µ - 0.6745σ <= x <= µ + 0.6745σ] = 0.50 , Approximately. PARAMETER Parameter is an unknown numerical factor of the population. The primary interest of any survey lies in knowing the values of different measures of the population distribution of a variable of interest. The measures of population distribution involves its mean, standard deviation etc. which is calculated on the basis of the population values of the variable. In other words, the parameter is a functional form of all the population unit. ESTIMATOR An estimator is a measure computed on the basis of sample values. It is a functional from of all sample observe prorating a representative value of the collected sample. STATISTIC Any statistical measure calculated on the basis of sample observations is called Statistic. Like sample mean, sample standard deviation, etc. Sample statistic are always known to us. Serial No. Mean Standard Deviation Frequency Relative Frequency Sample01 4456.25 3288.622 1 5.00% Sample02 4617.25 2566.8 2 10.00% Sample03 2006.25 754.08 1 5.00% Sample04 4617.25 2566.8 1 5.00% Sample05 2006.25 754.08 2 10.00% Sample06 5553.75 4924.163 1 5.00% Sample07 2006.25 754.08 3 15.00% Sample08 4617.25 2566.8 3 15.00% Sample09 2006.25 754.08 4 20.00% Sample10 5553.75 4924.163 2 10.00% Total 37440.5 23853.67 20 100.00% 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% RelativeFrequencies Mean Relative Frequency
  • 7.
    K 3/5, DLFPhase 2, Gurgaon, Haryana – 122 002. hello@dexlabanalytics.com +91 124 450 2444; +91 124 488 8144 +91 931 572 5902; +91 8527 872 444 www.dexlabanalytics.com 7 CONTACTUS