This article provides basics of the statistical techniques of Sampling and Sampling Distribution. Useful for students and scholars involved the research work in the field of humanities.
Data Collection tools: Questionnaire vs ScheduleAmit Uraon
Questionnaires and schedules are commonly used methods for collecting primary data. Questionnaires involve sending a standardized set of questions to respondents to answer on their own and return. Schedules are similar but involve an enumerator personally collecting responses by asking questions directly and filling out the schedule. Both methods can be used for descriptive or explanatory research and involve designing valid and reliable questions, representative sampling, and defining relationships between variables. Questionnaires are cheaper but have higher non-response rates while schedules provide more complete information through personal contact but are more expensive due to field workers.
Data processing involves 5 key steps: editing data, coding data, classifying data, tabulating data, and creating data diagrams. It transforms raw collected data into a usable format through these steps of cleaning, organizing, and analyzing the data. First, data is collected from sources and prepared by cleaning errors. It is then inputted and processed using algorithms before being output and interpreted in readable formats. Finally, the processed data is stored for future use and reports.
This document discusses various measures of dispersion used to quantify how spread out or clustered data values are around a central tendency. It defines key terms like range, variance, standard deviation, and coefficient of variation. Examples are provided to demonstrate how to calculate these measures for both individual and grouped data. The normal distribution curve is also discussed to show how dispersion relates to the percentage of values that fall within a given number of standard deviations from the mean.
Its a fully detailed topic about Editing , Coding, Tabulation o Data in research work.
The editing , coding , tabulation of data is been explained in this ppt.
Multistage sampling is a complex form of cluster sampling that uses multiple sampling methods together in stages. It first divides the population into primary sampling units and randomly selects some of these units. The selected units are then divided into secondary sampling units where another random sample is selected. This process can continue for third and fourth stages if needed. Multistage sampling is commonly used in large surveys to efficiently select samples across geographical areas in multiple stages.
This document discusses the Z-test, a statistical test used to compare means and proportions. The Z-test can be used to test if a sample mean differs from a population mean, if two sample means are equal, or if two population proportions are equal. It assumes the population is normally distributed. The steps involve formulating hypotheses, choosing a significance level, calculating the Z-statistic, and comparing it to a critical value to determine if the null hypothesis should be rejected or accepted. The Z-test is useful when sample sizes are large but requires knowing the population standard deviation.
an introduction and characteristics of sampling, types of sampling and errorsGunjan Verma
This document discusses sampling methods used in research. It defines key terms like population, sample, sampling units and strategies. The main types of sampling discussed are probability sampling which uses random selection, and non-probability sampling which does not. Specific probability methods covered include simple random sampling, systematic random sampling, stratified random sampling and cluster sampling. Non-probability methods discussed are convenience sampling, purposive sampling, quota sampling, and snowball sampling. The document also addresses sample size determination, sources of error in sampling like sampling error and non-sampling error, and concludes with advantages of sampling.
Data Collection tools: Questionnaire vs ScheduleAmit Uraon
Questionnaires and schedules are commonly used methods for collecting primary data. Questionnaires involve sending a standardized set of questions to respondents to answer on their own and return. Schedules are similar but involve an enumerator personally collecting responses by asking questions directly and filling out the schedule. Both methods can be used for descriptive or explanatory research and involve designing valid and reliable questions, representative sampling, and defining relationships between variables. Questionnaires are cheaper but have higher non-response rates while schedules provide more complete information through personal contact but are more expensive due to field workers.
Data processing involves 5 key steps: editing data, coding data, classifying data, tabulating data, and creating data diagrams. It transforms raw collected data into a usable format through these steps of cleaning, organizing, and analyzing the data. First, data is collected from sources and prepared by cleaning errors. It is then inputted and processed using algorithms before being output and interpreted in readable formats. Finally, the processed data is stored for future use and reports.
This document discusses various measures of dispersion used to quantify how spread out or clustered data values are around a central tendency. It defines key terms like range, variance, standard deviation, and coefficient of variation. Examples are provided to demonstrate how to calculate these measures for both individual and grouped data. The normal distribution curve is also discussed to show how dispersion relates to the percentage of values that fall within a given number of standard deviations from the mean.
Its a fully detailed topic about Editing , Coding, Tabulation o Data in research work.
The editing , coding , tabulation of data is been explained in this ppt.
Multistage sampling is a complex form of cluster sampling that uses multiple sampling methods together in stages. It first divides the population into primary sampling units and randomly selects some of these units. The selected units are then divided into secondary sampling units where another random sample is selected. This process can continue for third and fourth stages if needed. Multistage sampling is commonly used in large surveys to efficiently select samples across geographical areas in multiple stages.
This document discusses the Z-test, a statistical test used to compare means and proportions. The Z-test can be used to test if a sample mean differs from a population mean, if two sample means are equal, or if two population proportions are equal. It assumes the population is normally distributed. The steps involve formulating hypotheses, choosing a significance level, calculating the Z-statistic, and comparing it to a critical value to determine if the null hypothesis should be rejected or accepted. The Z-test is useful when sample sizes are large but requires knowing the population standard deviation.
an introduction and characteristics of sampling, types of sampling and errorsGunjan Verma
This document discusses sampling methods used in research. It defines key terms like population, sample, sampling units and strategies. The main types of sampling discussed are probability sampling which uses random selection, and non-probability sampling which does not. Specific probability methods covered include simple random sampling, systematic random sampling, stratified random sampling and cluster sampling. Non-probability methods discussed are convenience sampling, purposive sampling, quota sampling, and snowball sampling. The document also addresses sample size determination, sources of error in sampling like sampling error and non-sampling error, and concludes with advantages of sampling.
This document discusses sampling distribution about sample mean. It defines key terms like population, sample, sampling units, stratified random sampling, systematic sampling, cluster sampling, probability sampling, non-probability sampling, estimation, estimator, estimate, and sampling distribution. It also discusses the sampling distribution of the sample mean and provides an example to calculate and compare the mean and variance of sample means for sampling with and without replacement.
This document discusses various complex random sampling designs, including systematic sampling, stratified sampling, cluster sampling, multi-stage sampling, sampling with probability proportional to size, and sequential sampling. It provides details on how each design is implemented and their relative advantages and disadvantages. Complex random sampling designs combine elements of probability and non-probability sampling to select samples.
This document discusses various sampling designs and their characteristics. It describes probability sampling designs like simple random sampling which gives every unit an equal chance of selection. It also describes non-probability sampling designs like purposive sampling which involves deliberately choosing units. Specific probability designs discussed include systematic sampling, stratified sampling, cluster sampling, area sampling, and multi-stage sampling.
The document discusses the differences between census surveys and sample surveys. Census surveys collect information from the entire population, while sample surveys collect information from a representative sample of the population. Census surveys are more accurate but are also more time-consuming and costly compared to sample surveys, which can be completed more quickly and at lower cost, but have some margin of error since only a sample is studied rather than the entire population.
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...Alam Nuzhathalam
An overview of Sampling Techniques or Sampling Methods or Sampling Types (Probability Sampling: Simple Random Sampling, Stratified Random Sampling, Cluster Sampling, Systematic Random Sampling, Multi Stage Sampling and Non Probability Sampling: Convenience Sampling, Quota Sampling,Judgmental Sampling,Self Selection Sampling,Snow Ball Sampling) Sampling Errors and Non Sampling Errors..
This document provides an introduction and overview of a presentation on hypothesis testing for a single sample test. It includes an abstract, introduction, definitions, explanations of the central limit theorem and t-test, assumptions, examples, and a question/answer section on hypothesis testing. A group of 11 students will be presenting on hypothesis testing for a single sample test, including topics like the central limit theorem, t-test, z-test, assumptions of different tests, and examples of applying the tests.
Characteristics of a good sample design & types of sample designDr.Sangeetha R
The document discusses different types of sample designs, including their key characteristics and differences. It covers non-probability sampling designs like purposive sampling which rely on researcher judgement, and probability sampling designs like simple random sampling where every item has an equal chance of selection. Probability sampling is preferred because it allows estimating sampling errors and significance of results.
This document defines and explains several common measures of dispersion used in statistics including range, mean absolute deviation, variance, standard deviation, and coefficient of variation. Range is the difference between the highest and lowest values. Mean absolute deviation measures the average distance between values and the mean. Variance and standard deviation both measure how spread out numbers are by taking the average of the squared distances from the mean, with standard deviation being the square root of variance. Coefficient of variation expresses standard deviation as a percentage of the mean to allow comparison between data sets with different means.
Measurement & scaling ,Research methodologySONA SEBASTIAN
Measurement involves associating numbers or symbols to observations in a research study. There are different types of measurement scales including nominal, ordinal, interval, and ratio scales.
Nominal scales simply assign numbers or symbols to label elements without quantitative significance. Ordinal scales rank objects from largest to smallest but do not indicate the magnitude of differences. Interval scales assume equal units between numbers but lack a true zero point. Ratio scales have a true zero value and allow comparisons of differences between numbers through arithmetic operations.
Proper selection of measurement scales and techniques such as paired comparisons, ranking, rating, semantic differentials, and stapel scales depends on the characteristics and data type needed for the research.
This document discusses primary and secondary data. It defines primary data as data collected directly by the researcher through methods like observation, interviews, questionnaires, and surveys. Secondary data is data that has already been published through sources like books, journals, websites, and government records. The document outlines the merits and limitations of both primary and secondary data. It emphasizes the importance of evaluating secondary data for availability, relevance, accuracy, and sufficiency before using it in research.
This document discusses skewness and kurtosis in a financial context. It defines skewness as a measure of asymmetry in a distribution, with positive skewness indicating a long right tail and negative skewness a long left tail. Kurtosis is defined as a measure of the "peakedness" of a probability distribution, with positive excess kurtosis indicating flatness/long fat tails and negative excess kurtosis indicating peakedness. Formulas are provided for calculating skewness and kurtosis from a data set. Examples of positively and negatively skewed distributions are given to illustrate these concepts.
Research is defined as a systematic, empirical investigation guided by theory to understand natural phenomena. It involves identifying a problem, reviewing existing literature, developing hypotheses and variables, collecting and analyzing data, and drawing conclusions. There are important components to research including the research design, methodology, instrumentation, sampling, data analysis, and conclusions. Sampling involves selecting a subset of a population to study. Probability sampling aims to give all population members an equal chance of selection, while non-probability sampling does not. Common probability sampling methods include simple random sampling, systematic sampling, stratified sampling, cluster sampling, and multistage sampling.
The document discusses sampling design and methods. It defines key terms like universe, population, sample, and stratum. There are several advantages to sampling like collecting information more quickly and at lower cost compared to a full census. Probability sampling ensures each unit has a known chance of selection, while non-probability sampling does not. Specific probability sampling methods discussed include simple random sampling, where each unit has an equal chance of selection, and stratified random sampling, where the population is divided into subgroups and samples are drawn from each.
The document defines a sampling distribution of sample means as a distribution of means from random samples of a population. The mean of sample means equals the population mean, and the standard deviation of sample means is smaller than the population standard deviation, equaling it divided by the square root of the sample size. As sample size increases, the distribution of sample means approaches a normal distribution according to the Central Limit Theorem.
Cluster sampling refers to a method where the population is divided into groups called clusters. A simple random sample of these clusters is selected, and then all or a subset of elements within the selected clusters are included in the final sample. It is cheaper than simple random sampling but has a higher chance of sampling error. The key aspects are that the population is divided into clusters, a random sample of clusters is taken, and then data is collected from elements within those clusters.
The document discusses the F-test, which is used to compare the variances of two random samples to determine if they are significantly different. It provides the formula for calculating the F-statistic, outlines the assumptions of the test, and gives two examples calculating F to test if sample variances are equal or different at the 5% significance level. In both examples, the calculated F-value is less than the critical value from the F-distribution table, so the null hypothesis of equal variances is not rejected.
The document discusses various data processing and analysis techniques including:
1. Editing of raw data to detect and correct errors through field editing by investigators and central editing by a team.
2. Coding of responses by assigning numerals or symbols to classify answers into categories for analysis.
3. Classification of data by grouping into classes based on common attributes or class intervals.
4. Tabulation by summarizing data into statistical tables for further analysis according to accepted principles.
The document provides information about the Chi-square test, including:
- It is a non-parametric test used to evaluate categorical data using contingency tables. The test statistic follows a Chi-square distribution.
- It can test for independence between variables and goodness of fit to theoretical distributions.
- Key steps involve calculating expected frequencies, taking the difference between observed and expected, and summing the results.
- The test interprets higher Chi-square values as less likelihood the results are due to chance. Modifications like Yates' correction and Fisher's exact test address limitations for small sample sizes.
Certified Specialist Business Intelligence (.docxdurantheseldine
Certified Specialist Business
Intelligence (CSBI) Reflection
Part 5 of 6
CSBI Course 5: Business Intelligence and Analytical and Quantitative Skills
● Thinking about the Basics
● The Basic Elements of Experimental Design
● Sampling
● Common Mistakes in Analysis
● Opportunities and Problems to Solve
● The Low Severity Level ED (SL5P) Case Setup as an Example of BI Work
● Meaningful Analytic Structures
Analysis and Statistics
A key aspect of the work of the BI/Analytics consultant is analysis. Analysis can be defined as
how the data is turned into information. Information is the outcome when the data is analyzed
correctly.
Rigorous analysis is having the best chance of creating the sharpest picture of what the data
might reveal and is the product of proper application of statistics and experimental design.
Statistics encompasses a complex and detailed series of disciplines. Statistical concepts are
foundational to all descriptive, predictive and prescriptive analytic applications. However, the
application of simple descriptive statistical calculations yields a great deal of usable information
for transformational decision-making. The value of the information is amplified when using these
same simple statistics within the context of a well-designed experiment.
This module is not designed to teach one statistic. It is designed to place statistical work within
the appropriate context so that it can be leveraged most effectively in driving organizational
performance..
An important review of the basic knowledge for work with descriptive and inferential statistics.
The Basic Elements of Experimental Design
Analytic tools also can provide an enhanced ability to conduct experiments. More than just
allowing analysis of output of activities or processes, experiments can be performed on
processes and the output of processes. Experimenting on processes is a movement beyond
the traditional r.
Planning clinical supplies has become more complex due to increased trial numbers, reduced timelines, recruitment challenges, and globalization. Forecasting and simulation tools help sponsors determine initial supply needs, optimize supply chain strategies, and ensure supplies remain sufficient. An interactive response technology system automates supply management and provides real-time data to forecasting dashboards. These dashboards allow exploring scenarios to prevent issues like stockouts and optimize efficiency. Regularly checking forecasts enables proactive management of clinical supplies.
This document discusses sampling distribution about sample mean. It defines key terms like population, sample, sampling units, stratified random sampling, systematic sampling, cluster sampling, probability sampling, non-probability sampling, estimation, estimator, estimate, and sampling distribution. It also discusses the sampling distribution of the sample mean and provides an example to calculate and compare the mean and variance of sample means for sampling with and without replacement.
This document discusses various complex random sampling designs, including systematic sampling, stratified sampling, cluster sampling, multi-stage sampling, sampling with probability proportional to size, and sequential sampling. It provides details on how each design is implemented and their relative advantages and disadvantages. Complex random sampling designs combine elements of probability and non-probability sampling to select samples.
This document discusses various sampling designs and their characteristics. It describes probability sampling designs like simple random sampling which gives every unit an equal chance of selection. It also describes non-probability sampling designs like purposive sampling which involves deliberately choosing units. Specific probability designs discussed include systematic sampling, stratified sampling, cluster sampling, area sampling, and multi-stage sampling.
The document discusses the differences between census surveys and sample surveys. Census surveys collect information from the entire population, while sample surveys collect information from a representative sample of the population. Census surveys are more accurate but are also more time-consuming and costly compared to sample surveys, which can be completed more quickly and at lower cost, but have some margin of error since only a sample is studied rather than the entire population.
Sampling Techniques and Sampling Methods (Sampling Types - Probability Sampli...Alam Nuzhathalam
An overview of Sampling Techniques or Sampling Methods or Sampling Types (Probability Sampling: Simple Random Sampling, Stratified Random Sampling, Cluster Sampling, Systematic Random Sampling, Multi Stage Sampling and Non Probability Sampling: Convenience Sampling, Quota Sampling,Judgmental Sampling,Self Selection Sampling,Snow Ball Sampling) Sampling Errors and Non Sampling Errors..
This document provides an introduction and overview of a presentation on hypothesis testing for a single sample test. It includes an abstract, introduction, definitions, explanations of the central limit theorem and t-test, assumptions, examples, and a question/answer section on hypothesis testing. A group of 11 students will be presenting on hypothesis testing for a single sample test, including topics like the central limit theorem, t-test, z-test, assumptions of different tests, and examples of applying the tests.
Characteristics of a good sample design & types of sample designDr.Sangeetha R
The document discusses different types of sample designs, including their key characteristics and differences. It covers non-probability sampling designs like purposive sampling which rely on researcher judgement, and probability sampling designs like simple random sampling where every item has an equal chance of selection. Probability sampling is preferred because it allows estimating sampling errors and significance of results.
This document defines and explains several common measures of dispersion used in statistics including range, mean absolute deviation, variance, standard deviation, and coefficient of variation. Range is the difference between the highest and lowest values. Mean absolute deviation measures the average distance between values and the mean. Variance and standard deviation both measure how spread out numbers are by taking the average of the squared distances from the mean, with standard deviation being the square root of variance. Coefficient of variation expresses standard deviation as a percentage of the mean to allow comparison between data sets with different means.
Measurement & scaling ,Research methodologySONA SEBASTIAN
Measurement involves associating numbers or symbols to observations in a research study. There are different types of measurement scales including nominal, ordinal, interval, and ratio scales.
Nominal scales simply assign numbers or symbols to label elements without quantitative significance. Ordinal scales rank objects from largest to smallest but do not indicate the magnitude of differences. Interval scales assume equal units between numbers but lack a true zero point. Ratio scales have a true zero value and allow comparisons of differences between numbers through arithmetic operations.
Proper selection of measurement scales and techniques such as paired comparisons, ranking, rating, semantic differentials, and stapel scales depends on the characteristics and data type needed for the research.
This document discusses primary and secondary data. It defines primary data as data collected directly by the researcher through methods like observation, interviews, questionnaires, and surveys. Secondary data is data that has already been published through sources like books, journals, websites, and government records. The document outlines the merits and limitations of both primary and secondary data. It emphasizes the importance of evaluating secondary data for availability, relevance, accuracy, and sufficiency before using it in research.
This document discusses skewness and kurtosis in a financial context. It defines skewness as a measure of asymmetry in a distribution, with positive skewness indicating a long right tail and negative skewness a long left tail. Kurtosis is defined as a measure of the "peakedness" of a probability distribution, with positive excess kurtosis indicating flatness/long fat tails and negative excess kurtosis indicating peakedness. Formulas are provided for calculating skewness and kurtosis from a data set. Examples of positively and negatively skewed distributions are given to illustrate these concepts.
Research is defined as a systematic, empirical investigation guided by theory to understand natural phenomena. It involves identifying a problem, reviewing existing literature, developing hypotheses and variables, collecting and analyzing data, and drawing conclusions. There are important components to research including the research design, methodology, instrumentation, sampling, data analysis, and conclusions. Sampling involves selecting a subset of a population to study. Probability sampling aims to give all population members an equal chance of selection, while non-probability sampling does not. Common probability sampling methods include simple random sampling, systematic sampling, stratified sampling, cluster sampling, and multistage sampling.
The document discusses sampling design and methods. It defines key terms like universe, population, sample, and stratum. There are several advantages to sampling like collecting information more quickly and at lower cost compared to a full census. Probability sampling ensures each unit has a known chance of selection, while non-probability sampling does not. Specific probability sampling methods discussed include simple random sampling, where each unit has an equal chance of selection, and stratified random sampling, where the population is divided into subgroups and samples are drawn from each.
The document defines a sampling distribution of sample means as a distribution of means from random samples of a population. The mean of sample means equals the population mean, and the standard deviation of sample means is smaller than the population standard deviation, equaling it divided by the square root of the sample size. As sample size increases, the distribution of sample means approaches a normal distribution according to the Central Limit Theorem.
Cluster sampling refers to a method where the population is divided into groups called clusters. A simple random sample of these clusters is selected, and then all or a subset of elements within the selected clusters are included in the final sample. It is cheaper than simple random sampling but has a higher chance of sampling error. The key aspects are that the population is divided into clusters, a random sample of clusters is taken, and then data is collected from elements within those clusters.
The document discusses the F-test, which is used to compare the variances of two random samples to determine if they are significantly different. It provides the formula for calculating the F-statistic, outlines the assumptions of the test, and gives two examples calculating F to test if sample variances are equal or different at the 5% significance level. In both examples, the calculated F-value is less than the critical value from the F-distribution table, so the null hypothesis of equal variances is not rejected.
The document discusses various data processing and analysis techniques including:
1. Editing of raw data to detect and correct errors through field editing by investigators and central editing by a team.
2. Coding of responses by assigning numerals or symbols to classify answers into categories for analysis.
3. Classification of data by grouping into classes based on common attributes or class intervals.
4. Tabulation by summarizing data into statistical tables for further analysis according to accepted principles.
The document provides information about the Chi-square test, including:
- It is a non-parametric test used to evaluate categorical data using contingency tables. The test statistic follows a Chi-square distribution.
- It can test for independence between variables and goodness of fit to theoretical distributions.
- Key steps involve calculating expected frequencies, taking the difference between observed and expected, and summing the results.
- The test interprets higher Chi-square values as less likelihood the results are due to chance. Modifications like Yates' correction and Fisher's exact test address limitations for small sample sizes.
Certified Specialist Business Intelligence (.docxdurantheseldine
Certified Specialist Business
Intelligence (CSBI) Reflection
Part 5 of 6
CSBI Course 5: Business Intelligence and Analytical and Quantitative Skills
● Thinking about the Basics
● The Basic Elements of Experimental Design
● Sampling
● Common Mistakes in Analysis
● Opportunities and Problems to Solve
● The Low Severity Level ED (SL5P) Case Setup as an Example of BI Work
● Meaningful Analytic Structures
Analysis and Statistics
A key aspect of the work of the BI/Analytics consultant is analysis. Analysis can be defined as
how the data is turned into information. Information is the outcome when the data is analyzed
correctly.
Rigorous analysis is having the best chance of creating the sharpest picture of what the data
might reveal and is the product of proper application of statistics and experimental design.
Statistics encompasses a complex and detailed series of disciplines. Statistical concepts are
foundational to all descriptive, predictive and prescriptive analytic applications. However, the
application of simple descriptive statistical calculations yields a great deal of usable information
for transformational decision-making. The value of the information is amplified when using these
same simple statistics within the context of a well-designed experiment.
This module is not designed to teach one statistic. It is designed to place statistical work within
the appropriate context so that it can be leveraged most effectively in driving organizational
performance..
An important review of the basic knowledge for work with descriptive and inferential statistics.
The Basic Elements of Experimental Design
Analytic tools also can provide an enhanced ability to conduct experiments. More than just
allowing analysis of output of activities or processes, experiments can be performed on
processes and the output of processes. Experimenting on processes is a movement beyond
the traditional r.
Planning clinical supplies has become more complex due to increased trial numbers, reduced timelines, recruitment challenges, and globalization. Forecasting and simulation tools help sponsors determine initial supply needs, optimize supply chain strategies, and ensure supplies remain sufficient. An interactive response technology system automates supply management and provides real-time data to forecasting dashboards. These dashboards allow exploring scenarios to prevent issues like stockouts and optimize efficiency. Regularly checking forecasts enables proactive management of clinical supplies.
census, sampling survey, sampling design and types of sample designParvej Ahmed Porag
The document contains information about a presentation by a group of students on various sampling topics. It includes the names and roll numbers of 12 presentation members and 3 paragraphs written by 4 of the members on the topics of census, sample, and sampling survey. It provides basic definitions and examples for each topic.
This document discusses methods of data collection through census and sampling. It explains that census involves collecting data from all units in the population, while sampling collects data from a subset of representative units and uses it to make inferences about the overall population. It then outlines key merits and demerits of each approach, as well as common sampling methods like simple random sampling and stratified random sampling.
Experimental research designs are considered the standard for research. They involve assigning one or more dependent variables to different treatments and observing the results to draw conclusions. Experimental research has both advantages and disadvantages. It allows full researcher control but can be resource-intensive. It aims to determine relationships between dependent and independent variables by supporting or rejecting hypotheses. Data must be quantifiable and include measurements of variables like area, weight, temperature etc. Qualitative observations also supplement the research. Overall, experimental research uses a scientific approach to test business matters and understand customer behavior through product testing and experiments.
This document provides information on sampling techniques. It defines sampling as using a subset of a larger population to make inferences about the whole population. The purpose of sampling is to provide statistical information about the whole population while examining only a selected portion to reduce costs and increase efficiency and reliability compared to a census. The document outlines the steps in the sampling process, including defining the population, identifying a sampling frame, determining the sample size, and selecting the sample. It provides an example of calculating sample size to estimate average weekly internet usage.
Demand forecasting can be done using two approaches - obtaining information from experts or consumers, or using past sales data through statistical techniques. [1] Expert surveys include opinion polls and the Delphi technique. [2] Consumer surveys can be a complete enumeration or sample survey. [3] Complex statistical methods include time series analysis, correlation/regression analysis, and simultaneous equation models. Demand forecasting helps with production, financial, and workforce planning as well as decision making.
This document discusses research design in marketing research. It defines research design as a framework that details the procedures for obtaining needed information to solve research problems. The document outlines exploratory and conclusive research and their differences. It also discusses descriptive, causal, cross-sectional, and longitudinal research designs. Various sources of error in research designs are presented.
The document provides details about the life insurance industry in India including its history dating back to 1818, nationalization in 1956, and the opening up of the private sector in 2000. It discusses the key milestones in the evolution of life insurance regulation and the current state of competition in the rapidly changing industry. The challenges of customer prospecting in life insurance are explored in the context of increasing competition in the sector.
This document discusses audit sampling, including:
1. The definition and purpose of audit sampling, which is using procedures on less than 100% of items to make inferences about the whole population.
2. Factors that affect sample size such as population size, confidence level, precision, risk, and materiality.
3. Types of sampling methods like simple random sampling, stratified sampling, and cluster sampling.
4. The differences between tests of control and substantive tests, and between statistical and non-statistical sampling.
5. Key concepts like type I and type II errors, tolerable error, and expected error in the population.
This document provides guidance on using direct observation techniques to evaluate development programs. It discusses advantages such as observing programs in their natural setting, potential limitations like observer bias, and provides steps to conduct effective direct observations including determining the focus of observation, developing observation forms, selecting sites, deciding on timing, conducting observations, completing forms, and analyzing the data. Direct observation is recommended when performance is not meeting plans, implementation problems exist but are not understood, or process details need to be assessed.
3rd alex marketing club (pharmaceutical forecasting) dr. ahmed sham'aMahmoud Bahgat
#Mahmoud_Bahgat
#Marketing_Club
Join us by WhatsApp to me 00966568654916
*اشترك في صفحة ال Marketing Club* عالفيسبوك
https://www.facebook.com/MarketingTipsPAGE/
*اشترك في جروب ال Marketing Club* عالفيسبوك
https://www.facebook.com/groups/837318003074869/
*Marketing Club Middle East*
25 Meetings in 6 Cities in 1 year & 2 months
Since October 2015
*We have 6 groups whatsapp*
*for almost 600 marketers*
From all middle east
*since 5 years*
& now 10 more groups
For Marketing Club Lovers as future Marketers
أهم حاجة الشروط
*Only marketers*
From all Industries
No students
*No sales*
*No hotels Reps*
*No restaurants Reps*
*No Travel Agents*
*No Advertising Agencies*
*Many have asked to Attend the Club*
((We Wish All can Attend,But Cant..))
*Criteria of Marketing Club Members*
•••••••••••••••••••••••••••••••••••••
For Better Harmony & Mind set.
*Must be only Marketer*
*Also Previous Marketing experience*
●Business Managers
●Country Manager,GM
●Directors, CEO
Are most welcomed to add Value to us.
■■■■■■■■■■■■■■■■
《 *Unmatched Criteria*》
Not Med Rep,
Not Key Account,
Not Product Specialist,
Not Sales Supervisor,
Not Sales Manager,
●●●●●●●●●●●●●●●●●●
But till you become a marketer
you can join other What'sApp group
*Marketing Lover Future Club Group*
■■■■■■■■■■■■■■■■
《 *Unmatched Criteria*》
For Conflict of Intrest
*Also Can't attend*
If Working in
*Marketing Services Provider*
=not *Hotel* Marketers
=not *Restaurant* Marketers
=not *Advertising* Marketer
=not *Event Manager*
=not *Market Researcher*.
■■■■■■■■■■■■■■■■
■■■■■■■■■■■■■■■■
*this Club for Only Marketers*
Very Soon we will have
*Business Leaders Club*
For Sales Managers & Directors
Will be Not for Markters
●●●●●●●●●●●●●●●●●●●●
■ *Only Marketers* ■
*& EPS Marketing Diploma*
●●●●●●●●●●●●●●●●●●●●
Confirm coming by Pvt WhatsApp
*To know the new Location*
*#Mahmoud_Bahgat*
00966568654916
*#Marketing_Club*
http://goo.gl/forms/RfskGzDslP
*اشترك بصفحة جمعية الصيادلة المصريين* عالفيسبوك
https://lnkd.in/fucnv_5
■ *Bahgat Facbook Page*
https://lnkd.in/fVAdubA
■ *Bahgat Linkedin*
https://lnkd.in/fvDQXuG
■ *Bahgat Twitter*
https://lnkd.in/fmNC72T
■ *Bahgat YouTube Channel*
https://www.Youtube.com /mahmoud bahgat
■ *Bahgat Instagram*
https://lnkd.in/fmWPXrY
■ *Bahgat SnapChat*
https://lnkd.in/f6GR-mR
*#Mahmoud_Bahgat*
*#Legendary_ADLAND*
www.TheLegendary.info
The document appears to be a student report on understanding the challenges of customer prospecting in the life insurance industry. It includes sections on introduction, objectives, methodology, industry details, data analysis, secondary research on companies and competitors, and conclusions and recommendations. The report was submitted by Sachin B. Bone to complete his MBA curriculum requirements.
It is the process of selecting the sample for estimating the population characteristics. In other words, it is the process of obtaining information about an entire population by examining only a part of it.
This document discusses the concept and methodology of work sampling. It begins by defining work sampling as a statistical technique used to determine the proportion of time employees spend on different work activities. Work sampling involves taking many random observations over time to approximate how time is spent. It has advantages over other work measurement methods like being less disruptive and requiring less expertise. However, it also has limitations like not accounting for work pace. The document then outlines the typical procedure for conducting a work sampling study, including defining the problem, getting approvals, designing observation forms, determining observation frequency, and evaluating methods to reduce bias. Formulas are provided to determine required sample sizes for desired accuracy levels.
The document discusses different types of sampling designs used in research. It describes probability sampling methods like simple random sampling and systematic sampling which allow every unit in the population to have a chance of being selected. It also covers non-probability sampling which does not assure equal chance of selection. Key factors in sampling like sample size, target population, and parameters of interest are explained.
The document provides guidance on designing effective questionnaires. It emphasizes that questionnaires must have well-defined objectives in order to ask relevant questions and draw meaningful conclusions from the responses. Questions should follow logically from clear objectives. It also stresses that both open-ended and closed-format questions each have advantages, and the type of questions used should depend on the specific information needed. Demographic questions can help analyze response patterns among different groups. Overall, carefully considering objectives, question types, and question wording is essential for creating a questionnaire that efficiently gathers high-quality data.
The document summarizes a retail audit conducted on Lifebuoy, Lux, and Breeze soaps in Orai, Jalaun, India. The audit found that Lux and Lifebuoy were available in all 50 retail outlets surveyed, while Breeze was only available in 8 outlets. Most outlets displayed the HUL soaps at eye-level. Godrej No. 1 and Vivel were identified as the main competitors by retailers. On average, retailers sold over 40 dozen units of Lux and 18 dozen units of Lifebuoy per month. Most retailers were satisfied with the timely distribution of HUL products and their advertising effectiveness.
This document discusses different sampling methods used in research. It defines sampling as selecting a portion of a population to make generalizations about the whole population. There are two main types of sampling: probability sampling, where every member of the population has a known chance of being selected, and non-probability sampling, where not every member has a chance of selection. Some common probability sampling methods described are simple random sampling, systematic sampling, and stratified sampling. Non-probability sampling methods include convenience sampling and purposive sampling. The document provides examples and steps for implementing different sampling designs.
This document discusses sampling techniques used in research. It begins by defining key terms like population, census, sample, and sampling unit. It then outlines the 7 steps in the sampling process: 1) define the population, 2) identify the sampling frame, 3) specify the sampling unit, 4) specify the sampling method, 5) determine the sample size, 6) specify the sampling plan, and 7) select the sample. The document also discusses advantages and limitations of sampling, and describes probability and non-probability sampling designs. Probability sampling aims to give all units an equal chance of selection to help ensure results are representative of the overall population.
Similar to Sampling and Sampling Distribution (20)
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
3. 2
INDEX
Sampling and Sampling Distribution
Page No
1. Sampling 3
1.1 What is sampling ? 3
1.2 Why Sampling instead of Census? 4
1.3 Sampling methods 7
1.3.1 Probability Sampling Methods 7
1.3.1.1 Simple Random Sampling 8
1.3.1.2 Systematic Sampling 9
1.3.1.3 Stratified 10
1.3.1.4 Cluster Sampling 12
1.3.2 Non-Probability Sampling Methods 13
1.3.2.1 Convenience Sampling 14
1.3.2.2 Purposive Sampling 14
1.3.2.2.1 Judgment Sampling 15
1.3.2.2.2 Quota Sampling 15
2. Sampling Distribution 16
2.1 Sampling Distribution of the Mean 18
2.2 The Central Limit Theorem 22
2.3 Sampling Distribution of the variance 23
2.4 The Chi-square Distribution 24
2.5 Sampling Distribution of the proportion 26
2.6 The Confidence Level 27
3. Bibliography 29
***
4. 3
1. Sampling
When managers use research, they are applying the methods of science to the art
of management. Business operates in the world of uncertainity and there is no unique
method which can entirely eliminate this uncertainity. Nevertheless, the research
methodology can indeed minimise the extent of uncertainity and can reduce the probability
of making a wrong choice amongst alternative courses of action. Therefore, the increasingly
complex nature of of business and governance focusses more and more attention on the
use of research methodology in solving managerial problems. In the prevailing highly
involved environment neither a business decision nor a governmental decision can be made
casually or based on intutions.
It is through appropriate data
and their analysis that the
decision maker becomes
equipped with proper tools of
decision making. Needless to
say the the credibility of the
results derived from the
application of such methodology
is dependent upon the reliability of the data included in the analysis.
The quantitative tool of inferential statistics is extensively used to address
managerial and business problems by using the relevant data. The inferential statistics are
the quantitative tools that use samples to estimate something about a population
parameter above what can possibly happen by chance. Good research is only as good as the
design, methods, and statistics used. Yet, the design, methods, and statistics are useless if
first, an optimal sample is not used. Thus sampling is the corner stone of any business
research.
1.1 What is Sampling ?
The terminology “sampling” indicates the selection of a part of a group or an
aggregate with a view to obtaining information about the whole. This aggregate or the
POPULATION SAMPLE
STATISTICPARAMETER
Sampling
Estimation
Inference
Figure 1: Research Methodology
5. 4
totality of all members is known as Population although they need not be human beings.
The selected part, Which is used to ascertain the characteristics of the population is called
Sample. While choosing a sample, the population is assumed to be composed of individual
units or members, some of which are included in the sample. The total number of members
of the population and the number included in the sample are called Population Size and
Sample Size respectively. The concept can be shown through the following venn diagram
where the population is an universal set and sample is shown as a true subset.
Population: Set of all items
Sample: Set of chosen items
The process of generalising on the basis of information collected on a part is really a
traditional practice. With the advancement of management science more sophisticated
applications of sampling in business and industry are available. Sampling methodology can
be used by an auditor or an accountant to estimate the value of total inventory in the stores
without actually inspecting all the items physically. Opinion polls based on samples are used
to forecast the result of a forthcoming election.
1.2 Why Sampling instead of Census?
The census or complete enumeration consists in collecting data from each and every
unit from the population. The sampling only chooses a part of the units from the population
for the same study. The sampling has a number of advantages as compared to complete
enumeration due to a variety of reasons.
Cost
The first obvious advantage of sampling is that it is less expensive. If we want to study
the consumer reaction before launching a new product it will be much less expensive to
Figure 2: Population & Sample
6. 5
carry out a consumer survey based on a sample rather than studying the entire population
which is the potential group of customers. Although in decennial census every individual is
enumerated, certain aspects of the population are studied on a sample basis with a view to
reduce cost.
Time
The smaller size of the sample enables us to collect the data more quickly than to
survey all the units of the population even if we are willing to spend money. This is
particularly the case if the decision is time bound. An accountant may be interested to know
the total inventory value quickly to prepare a periodical report like a quarterly balance sheet
and a profit and loss account. A detailed study on the inventory is likely to take too long to
enable him to prepare the report in time. If we want to measure the Consumer Price Index
in a particular month we cannot collect data of all the consumer prices even if the
expenditure is not a hinderance. The collection of data on all the consumer items and their
processing in all probability are going to take pretty long time. Thus when ready, the price
index will not serve any meaningful purpose.
Accuracy
It is possible to achieve greater accuracy by using appropriate sampling techniques
than by a complete enumeration of all the units of the population. Contrary to the common
belief, complete enumeration may result in inaccuracies of the data owing to the fatigue of
the enumerator or spurious & unreliable data collected in view of large volume. On the
other hand, if a small number of items is observed the basic data will be much more
accurate. It is of course true that the conclusion about a population characteristic such as
the proportion of defective items from a sample will also introduce error in the system.
However, such errors, known as sampling errors, can be studied, controlled and probability
statements can be made about their magnitude. The accuracy which results due to fatigue
of the inspector is known as non sampling error. It is difficult to recognise the pattern of the
non sampling error and it is not possible to make any comment about its magnitude even
probabilistically.
7. 6
Reliability of Inference
In many cases, sampling provides adequate information so that not much additional
reliability can be gained with complete enumeration in spite of spending large amounts of
additional money and time. It is also possible to quantify the magnitude of possible error on
using some types of sampling which is not the case in census approach.
Impossibility of complete enumeration
In many situations the item being studied gets destroyed while being tested.
Sampling is indispensable under such circumstances.if one is interested in computing the
average life of Compact Fuoroscent Lamps (CFL) supplied in a batch, the life of entire batch
cannot be examined to compute the average life since this means that the entire supplywill
be wasted. Thus in such cases there is no other alternative than to examine the life of a
sample of CFLs and draw an inference about the entire batch.
Infeasibility of complete enumeration
More often than not, it is practically infeasible to do a complete enumeration due to
many practical difficulties. For example, if a shaving gel manufacturer wants to launch a new
& improved version of its gel. For getting consumer feedback the manufacturer distributed
old version of gel to say 500 consumers and after a week or so replaced it with the new
version to get feedback on various attributes of the product. In this situation, it would be
infeasible to collect information from all the consumers of shaving gel in India. Some
consumers would have moved from one place to another during the period of study, some
others would have stopped consuming shaving gel just before the period of study whereas
some others would have been users of shaving gel during the period of study but would
have stopped using it some time later. In such situations, although it is theoretically possible
to do a complete enumeration, it is practically infeasible to do so.
The above account clearly establishes that a research study gives more reliable
results at a greater convenience by way of sampling as compared with the study of the entire
population.
8. 7
1.3 Sampling methods
Good research is as only as
good as the design, methods,
and statistics used. Yet, the
design, methods, and
statistics are useless if first,
we don’t use an optimal
sample. A Sampling frame is a
list of all the units of the
population. The preparation
of a sampling frame should
always be upto date and be free from errors of omission and duplication of sampling units.
A perfect frame identifies each element once and only once. Perfect frames are seldom
available in real life. Nevertheless, it needs to be ensured that the sampling frame is
complete, accurate, adequate and up-to-date.
Further, depending on the requirement of various possibilities in research sampling
methods are broadly categorized into two groups viz Probability sampling methods and
Non probability sampling methods, as depicted in Figure 3.
1.3.1 Probability Samplling Methods
In probability sampling methods the population from which the sample is drawn should be
known to the researcher. Under this sampling design every item of the population has an
equal chance of inclusion in the sample. Lottery methods or selecting a student from the
complete students names from a box with blind or folded eyes is the best example of
random sampling, it is the best technique and unbiased method. It is the best process of
selecting representative sample.But the major disadvantage is that for this technique we
need the complete sampling frame i.e. the list of the complete items or population which is
not always available.
The probability sampling methods are of four types viz Simple Random Sampling, Systematic
Sampling, Stratified Sampling and Cluster Sampling.
Sampling
Probability Non-probability
Simple Random
Systematic
Stratified
Cluster
Convenience
Purposive
Quota
Judgment
Figure 3: Sampling Methods
9. 8
1.3.1.1 Simple Random Sampling
Simple random sampling is based on the concept of probability. The use of probability in
sampling theory makes it a reliable tool to draw inference or conclusion about the
population. Although the types of conclusion or inference can be quite diverse, two
particular types of decision making are quite prevalent in problems of business and
government.
On various occasions, the management would like to know the percentage or proportion of
units in the population with a certain characteristic. An organisation selling consumer
product may like to know the proportion of potential consumers using a certain type of
cosmetic. The government may like to know the percent of small farmers owning some
cultivable land in a rural region. A manufacturer planning to export some product may be
interested to ascertain the proportion of defect free units his system is capable of
manufacturing.
The representative character of a sample is ensured by allocating some probability to each
unit of the population for being included in the sample. The simple random sample assigns
equal probability to each unit of the population. The simple random sample can be chosen
both with and without replacement.
Simple Random Sampling with Replacement
Suppose the population consists of N units and we want to select a sample of size n. In simple
random sampling with replacement we choose an observation from the population in such
a manner that every unit of the population has an equal chance of 1/N to be included in the
sample. After the first unit is selected its value is recorded and it is again placed back in the
population. The second unit is drawn exactly in the same manner as the first unit This
procedure is continued until nth unit of the sample is selected. Obviously, in this case each
unit of the population has an equal chance of 1/N to be included in each of the n units of
the sample.
10. 9
Simple Random Sampling without Replacement
In this case when the first unit is chosen every unit of the population has a chance of 1/N to
be included in the sample. After the first unit is chosen it is no longer replaced in the
population. The second unit is selected from the remaining ‘N-1’ members of the population
so that each unit has a chance of
𝟏𝟏
𝑵𝑵−𝟏𝟏
to be included in the sample. The procedure is
continued till nth unit of the sample is chosen with probability
𝟏𝟏
[𝑵𝑵−𝒏𝒏+𝟏𝟏]
.
Random numbers for simple random sampling are generated using probabilistic mechanism.
1.3.1.2 Systematic Sampling
Systematic sampling involves selecting items
using a constant interval between the selections
depending on the sampling ratio – first interval
having a random start. For example, if a sample
of size 10 from a population of size 100 is
required, the sampling ratio would be n/N =
10/100= 1/10. It would, therefore, have to be decided where to start from among the first
10 names in our sampling frame. If this number happens to be 5 for example, then the
sample would contain members having serial numbers 5, 15, 25, 35, ……. 95 in the frame. It
is noteworthy that the random process establishes only the first member of the sample -
the rest are pre-determined because of the known sampling ratio. Usually the starting serial
number of sample is decided by allowing chance to play its role by using a table of random
numbers. In other words, the sampling starts by selecting an element from the list at
random and then every kth element in the frame is selected, where k, the sampling interval
(sometimes known as the skip): this is calculated as 𝒌𝒌 =
𝑵𝑵
𝒏𝒏
where n is the sample size, and N
is the population size.
Systematic sampling is relatively much easier to implement compared to simple random
sampling. However, there is one possibility that should be guarded against while using
systematic sampling - the possibility of a strong bias in the results if there is any periodicity
in the frame that parallels the sampling ratio. For example if someone were making studies
Figure 4: Systematic Sampling
11. 10
on the demand for various banking transactions in a bank branch by studying the demand
on some days randomly selected by systematic sampling and the chosen sampling ratio is
1/7 or 1/14 etc, he would always be studying the demand on the same day of the week and
the inferences could be biased depending on whether the day selected is a Monday or a
Friday and so on.
If the frame is arranged in an order, ascending or descending, of some attribute then the
location of the first sample element may affect the result of the study. For example, if the
frame contains a list of students arranged in a descending order of their percentage in the
previous examination and we are picking a systematic sample with a sampling ratio of 1/50.
If the first number picked is 1 or 2, then the sample chosen will be academically much better
off compared to another systematic sample with the first number chosen as 49 or 50. In
such situations, one should devise ways of nullifying the effect of bias due to starting number
by insisting on multiple starts after a small cycle or other such means.
On the other hand, if the frame is so arranged that similar elements are grouped together,
then systematic sampling produces almost a proportional stratified sample and would be,
therefore, more statistically efficient than simple random sampling.
Systematic sampling is perhaps the most commonly used method among the probability
sampling designs and for many purposes e.g. for estimating the precision of the results,
systematic samples are treated as simple random samples.
1.3.1.3 Stratified Sampling
The simple random sampling
may not always provide a
representative snapshot of the
population. Certain segments of
a population can easily be under
represented when an
unrestricted random sample is
chosen. Hence, when
considerable heterogeneity is present in the population with regard to subject matter under
n1
N1
Stratum-1 Stratum-pStratum-2
N2
Np
n2 np
Figure 5: Sratified Sampling
12. 11
study, it is often a good idea to divide the population into segments or strata and select a
certain number of sampling units from each stratum thus ensuring representation from all
relevant segments. Thus for designing a suitable marketing strategy for a consumers
durable, the population of consumers may be divided into strata by income level and a
certain number of consumers can be selected randomly from each strata.
Therefore, in stratified random sampling the population is first divided into different
homogeneous group or strata which may be based upon a single criterion such as male or
female. Or upon combination of more criteria like sex, caste, level of education and so on.
This method is generally applied when different category of individuals constitutes the
population viz General, OBC, SC, ST or upper income, middle income, lower income or small
farmers, big farmers, marginal farmers landless farmers etc. To have an actual picture of a
particular population about the standard of living, in case of India it is advisable to
categorized the population on the basis of caste, religion or land holding otherwise some
section may be under-represented or not represented at all.
Stratified random sampling may be of Proportionate Stratified Random Sampling or Dis-
Proportionate Stratified Random Sampling.
Proportionate Stratified Random Sampling
In case of proportionate random sampling method, the researcher stratifies the population
according to known characteristics and subsequently, randomly draws the sample in a
similar proportion from each stratum of the population according to its proportion. That is,
the population is divided into several sub-populations depending upon some known
characteristics, this sub population is called strata and they are homogeneous. For example,
a town area committee consists of 15000 voters among which 60% are Hindus, 30% are
Muslims and 10% are others and the researcher wants to draw a sample of 300 voters from
the population as per their proportion. That can be done by multiplying the sample number
with their proportion; as per this method the sample size of Hindu voter will be 300 x 60% =
180, Muslims will be 300 x 30% = 90 and others will be 300 x 10% = 30. So the researcher
has to collect the complete voter list of the town and randomly select the sample from each
category as calculated above. In this method the sampling error is minimized and the sample
possesses all the required characteristics of the population.
13. 12
Disproportionate Stratified Random Sampling
In this method the sampling unit in each stratum is not necessarily be as per their
population. Suppose for the said town the researcher wants to the know the voting pattern
of male and female of Hindu, Muslim and other voters; in that case he must take equal no.
of male and female voter from each category. Here the investigator has to give equal
weightage to each stratum. This is a biased type of sampling and in this case some stratum
is over-represented and some are less-represented; these are not truly representative
sampling, still this to be used in some special cases.
If the different strata in the population have unequal variances of the characteristic being
measured, then the sample size allocation decision should consider the variance as well. It
would be logical to have a smaller sample from a stratum where the variance is smaller than
from another stratum where the variance is higher. In fact, if 𝜎𝜎1
2
, 𝜎𝜎2
2
, … … , 𝜎𝜎𝑝𝑝
2
are the variance
of the p strata, then the statistical efficiency is the highest when –
𝑛𝑛1
𝑁𝑁1 𝜎𝜎1
=
𝑛𝑛1
𝑁𝑁2 𝜎𝜎2
= ⋯ =
𝑛𝑛𝑝𝑝
𝑁𝑁𝑝𝑝 𝜎𝜎𝑝𝑝
1.3.1.4 Cluster sampling
This is another type of probability sampling method, in which the sampling units are not individual
elements of the population, but group of elements or group of individuals are selected as sample. In
cluster sampling the total population is divided into a number of relatively small sub-divisions or
groups which are themselves clusters and then some of these cluster are randomly selected for
inclusion in the sample. Suppose a researcher wants to study the functioning of mid day meal service
in a district in that case he can use some schools clustering in a block or two without selecting the
schools scattering all over the district. Cluster sampling reduces the cost and labour of collecting the
data of the researcher but less precise than random sampling.
We can now compare Cluster Sampling with Stratified Sampling. Stratification is done to make the
strata homogeneous within and different from other strata. Clusters, on the other hand, should be
heterogeneous within and the different clusters should be similar to each other. A clusture, ideally,
is a mini-population and has all the features of the population.
The criterion used for stratification is a variable which is closely associated with the characteristic
we are measuring e.g. income level when we are measuring the family consumption of non-aerated
14. 13
beverages. On the other hand, convenience of data collection is usually the basis for cluster
definitions. Geographic contiguity is quite often used for clusture definitions and in such cases,
cluster sampling is also known as Area Sampling.
There are very fewer strata and one requires to pick up a random sample from each of the strata for
drawing inferences. In cluster sampling, there are many clusters out of which only a few are picked
up by random sampling and then the clusters are completely enumerated.
Multi-stage and Multi-phase Sampling
In this method sampling is drawn more than once . This is used in most of the large surveys where
the sampling unit is something larger than an individual element of the population in all stages but
the final. For example, in a national survey on the demand of fertilizers one might use stratified
sampling in the first stage with a district as a sampling unit and the average rainfall in the district as
the criterion for stratification. Having obtained 20 districts from this stage, cluster sampling may be
used in the second stage to pick up 10 villages in each of the selected districts. Finally, in the third
stage, stratified sampling may be used in each village to pick up frames in each of the strata defined
with land holding as the criterion.
Multi-phase sampling, on the other hand, is designed to make use of the information collected in
one phase to develop a sampling design in a subsequent phase. A study with two phases is often
called Double Sampling. The first phase of the study might reveal a relationship between the family
consumption of non-aerated beverages and the family income and this information would then be
used in the second phase to stratify the population with family income as the criterion.
1.3.2 Non Probability Sampling Methods
Probability sampling has some theoretical advantages over non-probability sampling. The bias
introduced due to sampling could be completely eliminated and it is possible to set a confidence
interval for the population parameter that is being studied. In spite of these advantages of
probability sampling, non-probability sampling is used quite frequently in many sampling surveys.
This is so because all are based on practical considerations.
Probability sampling requires a list of all the sampling units and this frame is not available in many
situations nor is it practically feasible to develop a frame of say all the households in a city or zone
or ward of a city. Sometimes the objective of the stuc may not be to draw a statistical inference
about the population but to get familiar wit extreme cases or other such objectives. In a dealer
survey, our objective may be to get familiar with the problems faced by our dealers so that we can
15. 14
take some corrective actions, wherever possible. Probability sampling is rigorous and this rigour e.g.
in selecting samples, adds to the cost of the study. And finally, even when we are doing probability
sampling, there are chances of deviations from the laid out process especially where some samples
are selected by the interviewers at site - say after reaching a village. Also, some of the sample
members may not agree to be interviewed or not available to be interviewed and our sample may
turn out to be a non-probability sample in the strictest sense of the term.
1.3.2.1 Convenience Sampling
In this type of non-probability sampling, the choice of the sample is left completely to the
convenience of the researcher. The cost involved in picking up the sample is minimum and the cost
of data collection is also generally low, e.g. the researcher can go to same retail shops and interview
some shoppers while studying the demand for some commodity.
Another form of convenience sampling is known as ‘Snow Ball Sampling’. This is a sociometric
sampling technique generally used to study the small groups. All the persons in a group identify their
friends who in turn know their friends and colleagues, until the informal relationships converge into
some type of a definite social pattern. It is like a snow ball increasing its size as it rolls down an ice-
field. For example in case of research regarding drug addict people it is difficult to find out who are
the drug users but when one person is identified he can tell the names of his partners then each of
his partner can tell another 2 or 3 names whom he knows uses drug . This way the required number
of elements/persons are identified and data is collected. This method is suitable for diffusion of
innovation, network analysis, decision making.
However, such samples can suffer from excessive bias from known or unknown sources and also
there is no way that the possible errors can be quantified.
1.3.2.2 Purposive Sampling
In convenience sampling, any member of the population can be included in the sampl without any
restriction. When some restrictions are put on the possible inclusion of a member in the sample, the
sampling is called purposive. This is a non random sampling method where the researcher selects
the sample arbitrarily which he considers important for the research and believes it as typical and
representative of the population. Say, a researcher wants to forecast the chance of coming into the
power of a political party in general election. He may select some reporters, some teachers and some
elite people of the territory and collect their opinions for the purpose of his study. He considers
those are the leading persons and their view are relevant for the chance of coming in to the power
16. 15
of the party. As it is a purposive method it has big sampling errors and carry misleading conclusion.
The purposive sampling is broadly of two types viz Judgment Sampling and Quota Sampling.
1.3.2.2.1 Judgment Sampling
In judgment sampling, the judgment or opinion of some experts forms the basis for sample selection.
The experts are persons who are believed to have information on the population which can help in
giving us better samples. Such sampling is very useful when we want to study rare events, or when
members have extreme positions, or even when the objective of the study is to collect a wide cross-
section of views from one extreme to the other.
1.3.2.2.2 Quota Sampling
Even while using non-probability sampling, one might want our sample to be representative of the
population in some defined ways. This is sought to be achieved in quota sampling so that the bias
introduced by sampling could be reduced.
If in a given population, 25% of the members belong to the high income group, 25% to the middle
income group, 35% to the low income group and 15 % are Below Poverty Line (BPL) and we are using
quota sampling, we would specify that the sample should also contain members in the same
proportion as in the population e.g. 15% of the sample members would belong to the BPL group
and so on.
The criteria used to set quotas could be many. For example, family size could be another criterion
and we can set quotas for families with family size upto 3, between 4 & 5, and above 5. However, if
the number of such criteria is large, it becomes difficult to locate sample members satisfying the
combination of the criteria. In such cases, the overall relative frequency of each criterion in the
sample is matched with the overall relative frequency of the criterion in the population.
This method of sampling is almost same with that of stratified random sampling as stated above, the
only difference is that here in selecting the elements randomization is not done instead quota is
taken into consideration. As quota sampling is not random so sampling method is biased and lead to
large sampling errors.
17. 16
2. The Sampling Distribution
Sample statistics form the basis of all inferences drawn about populations. If we know the probability
distribution of the sample statistic, then we can calculate the probability that the sample statistic
assumes a particular value (if it is a discrete random variable) or has a value in a given interval. This
ability to calculate the probability that the sample statistic lies in a particular interval is the most
important factor in all statistical inferences. Let’s demonstrate this by an example.
Suppose we know that 55% of the population of all users of Shampoo prefer brand ‘A’ to the next
competing brand. A “new improved” version of ‘A’ has been developed and given to a random
sample of 200 shampoo users for use. If 120 of these prefer the “new improved” version to the next
competing brand, what should one conclude? For an answer, we would like to know the probability
that the sample proportion in a sample of size 200 is as large as 60% or higher when the true
population proportion is only 55%, i.e. assuming that the new version is no better than the old. If
this probability is quite large, say 0.5, we might conclude that the high sample proportion viz. 60% is
perhaps because of sampling errors and the new version is not really superior to the old. On the
other hand, if this probability works out to a very small figure, say 0.001, then we might conclude
that the true population proportion is higher than 55%, i.e. the new version is actually superior to
the old one as perceived by members of the population. To calculate this probability, we need to
know the probability distribution of sample proportion or the sampling distribution of the
proportion.
The sampling distribution, thus, is a distribution of a sample statistic. It is a model of a distribution
of scores, like the population distribution, except that the scores are not raw scores, but statistics. It
is a thought experiment; "what would the world be like if a person repeatedly took samples of size
N from the population distribution and computed a particular statistic each time?" The resulting
distribution of statistics is called the sampling distribution of that statistic.
For example, suppose that a sample of size sixteen (N=16) is taken from some population. The mean
of the sixteen numbers is computed. Next a new sample of sixteen is taken, and the mean is again
computed. If this process were repeated an infinite number of times, the distribution of the now
infinite number of sample means would be called the sampling distribution of the mean. Similarly,
every statistic has a sampling distribution.
Just as the population models can be described with parameters, so can the sampling distribution.
The expected value (analogous to the mean) of a sampling distribution will be represented here by
18. 17
the symbol µ. The µ symbol is often written with a subscript to indicate which sampling distribution
is being discussed. For example, the expected value of the sampling distribution of the mean is
represented by the symbol 𝜇𝜇𝑥𝑥̅, that of the median by 𝜇𝜇 𝑀𝑀𝑑𝑑
, etc. The value of 𝜇𝜇𝑥𝑥̅ can be thought of as
the mean of the distribution of means. In a similar manner the value of 𝜇𝜇 𝑀𝑀𝑑𝑑
is the mean of a
distribution of medians. They are not really means, because it is not possible to find a mean when
𝑁𝑁 = ∞, but are the mathematical equivalent of a mean.
Using advanced mathematics, in a thought experiment, the theoretical statistician often discovers a
relationship between the expected value of a statistic and the model parameters. For example, it
can be proven that the expected value of both the mean and the median, 𝑋𝑋� and Md, is equal to µ
x .
When the expected value of a statistic equals a population parameter, the statistic is called an
unbiased estimator of that parameter. In this case, both the mean and the median would be an
unbiased estimator of the parameter 𝜇𝜇𝑥𝑥̅.
A sampling distribution may also be described with a parameter corresponding to a variance,
symbolized by 𝜎𝜎2
. The square root of this parameter is given a special name, the standard error.
Each sampling distribution has a standard error. In order to keep them straight, each has a name
tagged on the end of the word "standard error" and a subscript on the σ symbol. The standard
deviation of the sampling distribution of the mean is called the standard error of the mean and is
symbolized by 𝜎𝜎𝑥𝑥̅. Similarly, the standard deviation of the sampling distribution of the median is
called the standard error of the median and is symbolized by 𝜇𝜇 𝑀𝑀𝑑𝑑
.
In each case the standard error of a statistics describes the degree to which the computed statistics
will differ from one another when calculated from sample of similar size and selected from similar
population models. The larger the standard error, the greater the difference between the computed
statistics. Consistency is a valuable property to have in the estimation of a population parameter, as
the statistic with the smallest standard error is preferred as the estimator of the corresponding
population parameter, everything else being equal. Statisticians have proven that in most cases the
standard error of the mean is smaller than the standard error of the median. Because of this
property, the mean is the preferred estimator of 𝜇𝜇𝑥𝑥.
In practice, we refer to the sampling distributions of only the commonly used sampling statistics like
the sample mean, sample variance, sample proportion, sample median etc., which have a role in
making inferences about the population.
19. 18
2.1 The Sampling Distribution of the Mean
There are many (infinite!) possible values of the sample mean and the particular value that we
obtain, if we pick up only one sample, is determined only by chance. The distribution of the sample
mean is also referred to as the sampling distribution of the mean.
However, to observe the distribution of x empirically, we have to take many samples of size n and
determine the value of x for each sample. Then, looking at the various observed values of x, it might
be possible to get an idea of the nature of the distribution.Such sampling distribution of the mean is
known as distribution of sample means. This distribution is described with the parameters 𝜇𝜇𝑥𝑥̅ and
𝜎𝜎𝑥𝑥̅ .
Sampling from Infinite Populations
Let’s study two cases –
1. Where the population is infinitely large or when the sampling is done with replacement
2. Where the population is finite and we are sampling without replacement
For the first scenario let’s assume we have a population which is infinitely large and having a
population mean of µ . and a population variance of 𝜎𝜎2
. This implies that if x is a random variable
denoting the measurement of the characteristic that we are interested in, on one element of the
population picked up randomly, then the expected value of x, E(x) = µ and the variance of x, Var (x)
= 𝜎𝜎2
The sample mean, 𝑥𝑥̅ , can be looked at as the sum of n random variables, viz x1, x2,..., xn, each being
divided by (1/n). Here x1, is a random variable representing the first observed value in the sample,
x2 representing the second observed value and so on. Now, when the population is infinitely large,
whatever be the value of x1, the distribution of x2 is not affected by it. This is true of any other pair
of random variables as well. In other words x1, x2,..., xn are independent random variables and all are
picked up from the same population.
∴ 𝐸𝐸(𝑥𝑥1) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥1) = 𝜎𝜎2
𝐸𝐸(𝑥𝑥2) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥2) = 𝜎𝜎2
… and so on
Finally,
𝐸𝐸(𝑥𝑥̅) = 𝐸𝐸 �
(𝑥𝑥1+𝑥𝑥2+⋯+𝑥𝑥𝑛𝑛)
𝑛𝑛
�
20. 19
=
1
𝑛𝑛
𝐸𝐸(𝑥𝑥1) +
1
𝑛𝑛
𝐸𝐸(𝑥𝑥2) + ⋯ +
1
𝑛𝑛
𝐸𝐸(𝑥𝑥𝑛𝑛)
=
1
𝑛𝑛
𝜇𝜇 +
1
𝑛𝑛
𝜇𝜇 + ⋯ +
1
𝑛𝑛
𝜇𝜇
= 𝜇𝜇
This means that the expected value of the sample mean is the same as the population mean.
and Var(𝑥𝑥̅)= 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥1+𝑥𝑥2+⋯+𝑥𝑥𝑛𝑛
𝑛𝑛
�
= 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥1
𝑛𝑛
� + 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥2
𝑛𝑛
� + ⋯ + 𝑉𝑉𝑉𝑉𝑉𝑉 �
𝑥𝑥𝑛𝑛
𝑛𝑛
�
=
1
𝑛𝑛2
𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥1) +
1
𝑛𝑛2
𝑉𝑉𝑉𝑉𝑉𝑉 (𝑥𝑥2) + ⋯ +
1
𝑛𝑛2
𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥𝑛𝑛)
=
1
𝑛𝑛2 𝜎𝜎2
+
1
𝑛𝑛2 𝜎𝜎2
+ ⋯ +
1
𝑛𝑛2 𝜎𝜎2
=
𝜎𝜎2
𝑛𝑛
This says that the variance of the sample mean is the variance of the population divided by the
sample size.
If we take a large number of samples of size n, then the average value of the sample means tends to
be close to the true population mean. On the other hand, if the sample site is increased then the
variance of 𝑥𝑥̅ gets reduced and by selecting an appropriately large value of n. the variance of x can
be made as small as desired.
The standard deviation of 𝑥𝑥̅ is also called the standard error of the mean. Very often we estimate
the population mean by the sample mean. The standard error of the mean indicates the extent to
which the observed value of sample mean can be away from the true value, due to sampling errors.
For example, if the standard error of the mean is small, we are reasonably confident that
whatever sample mean value we have observed cannot be very far away from the true value. The
standard error of the mean is represented by 𝜎𝜎𝑥𝑥̅.
Sampling with replacement
The above results have been obtained under the assumption that the random variables 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛
are independent. This assumption is valid when the population is infinitely large. It is also valid when
21. 20
the sampling is done with replacement, so that the population is back to the same form before the
next sample member is picked up.
Hence, if the sampling is done with replacement, we would again have-
𝐸𝐸(𝑥𝑥̅) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥̅) =
𝜎𝜎2
√ 𝑛𝑛
meaning thereby that 𝜎𝜎𝑥𝑥̅ =
𝜎𝜎
√ 𝑛𝑛
Sampling Without Replacement from Finite Populations
When a sample is picked up without replacement from a finite population, the probability
distribution of the second random variable depends on what has been the outcome of the first pick
and so on. As the n random variables representing the n sample members do not remain
independent, the expression for the variance of 𝑥𝑥̅ changes. Results of derivation for this situation
works out as under-
𝐸𝐸(𝑥𝑥̅) = 𝜇𝜇 and 𝑉𝑉𝑉𝑉𝑉𝑉(𝑥𝑥̅) = 𝜎𝜎𝑥𝑥
2
=
𝜎𝜎2
𝑛𝑛
.
𝑁𝑁−𝑛𝑛
𝑁𝑁−1
meaning thereby that 𝜎𝜎𝑥𝑥̅ =
𝜎𝜎
√ 𝑛𝑛
. �
𝑁𝑁−𝑛𝑛
𝑁𝑁−1
By comparing these expressions with the ones derived above we find that the standard error of 𝑥𝑥̅ is
the same but further multiplied by a factor �(𝑁𝑁 − 𝑛𝑛)/(𝑁𝑁 − 1) . This factor is, therefore, known as
the finite population multiplier.
In practice, almost all the samples used are picked up without replacement. Also, most populations
are finite although they may be very large and so the standard error of the mean should theoretically
be found by using the expression given above. However, if the population size (N) is large and
consequently the sampling ratio (n/N) small, then the finite population multiplier is close to 1 and is
not used, thus treating large finite populations as if they were infinitely large. For example, if N =
5,00,000 and n=500, the finite population multiplier -
�
𝑁𝑁−𝑛𝑛
𝑁𝑁−1
= �
5,00,000−500
5,00,000−1
= �
499500
499999
= √0.999002 = 0.9995 which is very close to 1 and the standard
error of the mean would, for all practical purposes, be the same whether the population is treated
as finite or infinite. As a rule of that, the finite population multiplier may not be used if the sampling
ratio (n/N) is smaller than 0.05.
Sampling from Normal Populations
It has been observed that the normal distribution occurs very frequently among many natural
phenomena. For example, heights or weights of individuals, the weights of filled-bags from an
automatic machine, the hardness obtained by heat treatment, etc. are distributed normally.
22. 21
It is also known fact that the sum of two independent random variables will follow a normal
distribution if each of the two random variables belongs to a normal population. The sample mean,
as we have seen earlier is the sum of n random variables 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 each divided by n. Now, if
each of these random variables is from the same normal population, it is not difficult to see that 𝑥𝑥̅
would also be distributed normally.
Let 𝑥𝑥~𝑁𝑁(𝜇𝜇, 𝜎𝜎2) symbolically represent the fact that the random variable x is distributed normally
with mean n and variance 𝜎𝜎2
. Thus,
If 𝑥𝑥~𝑁𝑁(𝜇𝜇, 𝜎𝜎2) then it follows that 𝑥𝑥~𝑁𝑁 �𝜇𝜇,
𝜎𝜎
𝑛𝑛
2
�
The normal distribution is a continuous distribution and so the population cannot be small and finite
if it is distributed normally; that is why the finite population multiplier is not used in the above
expression. Let’s see, by an example, how to make use of the above result.
Suppose the weight of candy produced on a semi-automatic machine is known to be distributed
normally with a mean of 10 mg and a standard deviation of 0.1 mg. If we pick up a random sample
of size 5, what is the probability that the sample mean will be between 9.95 mg and 10.05 mg?
Let x be a random variable representing the weight of one candy picked up at random.
We know that 𝑥𝑥 − 𝑁𝑁( 10, 0.01)
Therefore, it follows that 𝑥𝑥̅~ 𝑁𝑁 �10,
0.01
5
�
This denots that 𝑥𝑥̅ will be distributed normally with a mean of 10 and a variance which is only 1/5
of the variance of the population, since the sample size is 5.
𝑃𝑃𝑃𝑃 {9.95 ≤ 𝑥𝑥̅ ≤ 10.05} = 2 × Pr{10 ≤ 𝑥𝑥̅ ≤ 10.05}
= 2 × Pr �
10−𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
≤
𝑥𝑥̅− 𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
≤
10.05−𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
�
= 2 × Pr �0 ≤ 𝑧𝑧 ≤
10.05−10
0.1
√5
�
�
= 2 × Pr{0 ≤ 𝑧𝑧 ≤ 1.12}
=2 × 0.3686
= 0.7372
23. 22
Figure 6: Distribution of 𝒙𝒙� the enclosed area represents the probability of the random variable 𝒙𝒙� between 9.95 and 10.05
We first make use of the symmetry of the normal distribution and then calculate the z value by
subtracting the mean and then dividing it by the standard deviation of the random variable
distributed normally, viz 𝑥𝑥̅. The probability of interest is also shown as the enclosed area in Figure 6
above.
2.2 The Central Limit Theorem
The above parameters are closely related to the parameters of the population distribution, with the
relationship being described by the Central Limit Theorem. The Central Limit Theorem essentially
states that the mean of the sampling distribution of the mean (𝜇𝜇𝑥𝑥̅) equals the mean of the population
( 𝜇𝜇𝑥𝑥) and that the standard error of the mean (𝜎𝜎𝑥𝑥̅) equals the standard deviation of the population (
𝜎𝜎𝑥𝑥) divided by the square root of N as the sample size gets infinitely larger (𝑁𝑁 ≥ ∞). In addition, the
sampling distribution of the mean will approach a normal distribution. These relationships may be
summarized as follows:
𝜇𝜇𝑥𝑥= 𝜇𝜇𝑥𝑥̅ and 𝜎𝜎𝑥𝑥=
𝜎𝜎𝑥𝑥
√ 𝑁𝑁
It is observed that the sample size needs to be very large (∞) in order for these relationships to hold
true. In theory, this is fact; in practice, an infinite sample size is impossible.
In most situations encountered by researchers, the Central Limit Theorem works reasonably well
with an N greater than 10 or 20. Thus, it is possible to closely approximate what the distribution of
sample means looks like, even with relatively small sample sizes.
9.95 µ=1 10.05
𝜎𝜎𝑥𝑥̅ =
0.1
√5
𝑥𝑥̅ →
24. 23
The importance of the Central Limit Theorem to statistical thinking cannot be overstated. Most of
hypothesis testing and sampling theory are based on this theorem. In addition, it provides a
justification for using the normal curve as a model for many naturally occurring phenomena. If a
trait, such as intelligence, can be thought of as a combination of relatively independent events, in
this case both genetic and environmental, then it would be expected that trait would be normally
distributed in a population.
We need to use the central limit theorem when the population distribution is either unknown or
known to be non-normal. If the population distribution is known to be normal, then 𝑥𝑥̅ will also be
distributed normally, irrespective of the sample size.
2.3 The Sampling Distribution of the Variance
Before attempting to discuss the sampling distribution of the variance, it is worthwhile to first
introduce the concept of sample variance and then present the chi-square distribution which helps
us in working out probabilities for the sample variance, when the population is distributed normally.
The Sample Variance
In probability theory and statistics, the variance is a measure of how far a set of numbers is spread
out. A variance of zero indicates that all the values are identical. A non-zero variance is always
positive: a small variance indicates that the data points tend to be very close to the mean (expected
value) and hence to each other, while a high variance indicates that the data points are very spread
out from the mean and from each other.
We use the sample mean to estimate the population mean, when that parameter is unknown.
Similarly , we use a sample statistic called the sample variance to estimate the population variance.
The sample variance is usually denoted by 𝑠𝑠2
and it again captures some kind of an average of the
square of deviations of the sample values from the sample mean. Let us put it in an equation form
𝑠𝑠2
=
∑ (𝑥𝑥𝑖𝑖−𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1
𝑛𝑛−1
By comparing this expression with the corresponding expression for the population variance, we
notice two differences. The deviations are measured from the sample mean and not from the
population mean and secondly, the sum of squared deviations is divided by (n -1) and not by n.
Consequently, we can calculate the sample variance based only on the sample values without
knowing the value of any population parameter. The division by (n - 1) is due to a technical reason
to make the expected value of s2 equal 𝜎𝜎2
, which it is supposed to estimate.
25. 24
2.4 The Chi-square Distribution
The 𝜒𝜒2
distribution is an asymmetric distribution that has a minimum value of 0, but no maximum
value. The curve reaches a peak to the right of 0, and then gradually declines in height, the larger
the 𝜒𝜒2
value is. The curve approaches, but never quite touches, the horizontal axis.
For each degree of freedom there is a different 𝜒𝜒2
distribution. The mean of the chi square
distribution is the degree of freedom and the standard devi-ation is twice the degrees of freedom.
This implies that the 𝜒𝜒2
distribution is more spread out, with a peak farther to the right, for larger
than for smaller degrees of freedom. As a result, for any given level of significance, the critical region
begins at a larger chi square value, the larger the degree of freedom.
In its graphical represntation the 𝜒𝜒2
value is on the horizontal axis, with the probability for each
𝜒𝜒2
value being represented by the vertical axis. The three lines in the diagram represents the pattern
of chi square for degrees of freedom as 1, 5 and 10 respectively.
Figure 7: Chi-square distribution with different degrees of freedom
If the random variable x has the standard normal distribution, what would be the distribution of 𝜒𝜒2
?
Intuitively speaking, it would be quite different from a normal distribution because now 𝜒𝜒2
, being a
squared term, can assume only non-negative values. The probability density of 𝜒𝜒2
will be the highest
near 0, because most of the value are close to 0 in a standard normal distribution. This distribution
is called the chi-square distribution with 1 degree of freedom.
The chi-square distribution has only one parameter viz. the degrees of freedom and so there are
many chi-square distributions each with its own degrees of freedom. In statistical tables, chi-square
values for different are.as under the right tail and the left tail of various chi-square distributions are
tabulated.
26. 25
If 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 are independent random variables, each having a standard normal distribution, then
𝑥𝑥1 + 𝑥𝑥2 + ⋯ + 𝑥𝑥𝑛𝑛 will have a chi-square distribution with n degrees of freedom.
If 𝑦𝑦1 and 𝑦𝑦2 are independent random variables having chi-square distributions with 𝛾𝛾1 and 𝛾𝛾2
degrees of freedom, then (𝑦𝑦1 + 𝑦𝑦2) will have a chi-square distribution with 𝛾𝛾1 + 𝛾𝛾2 degrees of
freedom.
Further, if 𝑦𝑦1 and 𝑦𝑦2 are independent random variables such that 𝑦𝑦1 has a chi-square distribution
with 𝛾𝛾1 degrees of freedom and (𝑦𝑦1 + 𝑦𝑦2) has a chi-square distribution with 𝛾𝛾 > 𝛾𝛾1 degrees of
freedom, then 𝑦𝑦2 will have a chi-square distribution with (𝛾𝛾 − 𝛾𝛾1) degrees of freedom.
Now, if 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑛𝑛 are n random variables from a normal population with mean 𝜇𝜇 and variance
𝜎𝜎2
,
i.e. 𝑥𝑥𝑖𝑖~𝑁𝑁(𝜇𝜇, 𝜎𝜎2), 𝑖𝑖 = 1,2, … , 𝑛𝑛
it implies that
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
~𝑁𝑁(0,1)
and so �
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
�
2
will have a chi-square distribution with 1 degree of freedom.
Hence, ∑ �
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
�
2
𝑛𝑛
𝑖𝑖=1 will have a chi-square distribution with n degrees of freedom.
We can break up this expression by measuring the deviation from 𝑥𝑥 in place of 𝜇𝜇. We will then have
∑ �
𝑥𝑥𝑖𝑖−𝜇𝜇
𝜎𝜎
�
2
𝑛𝑛
𝑖𝑖=1 =
1
𝜎𝜎2
∑ [(𝑥𝑥𝑖𝑖 − 𝑥𝑥̅) + (𝑥𝑥̅ − 𝜇𝜇)]2𝑛𝑛
𝑖𝑖=1
=
1
𝜎𝜎2
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)2𝑛𝑛
𝑖𝑖=1 +
1
𝜎𝜎2
∑ (𝑥𝑥̅ − 𝜇𝜇)2𝑛𝑛
𝑖𝑖=1 +
2(𝑥𝑥̅−𝜇𝜇)
𝜎𝜎2
∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)𝑛𝑛
𝑖𝑖=1
=
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 + �
𝑥𝑥̅− 𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
�
2
since ∑ (𝑥𝑥𝑖𝑖 − 𝑥𝑥̅)𝑛𝑛
𝑖𝑖=1 = 0
Now, it is known that the LHS of the above equation is a random variable which has a chi-square
distribution with n degrees of freedom. It is also known that –
𝑥𝑥̅~ 𝑁𝑁 �𝜇𝜇.
𝜎𝜎2
𝑛𝑛
�
∴ �
𝑥𝑥̅− 𝜇𝜇
𝜎𝜎
√ 𝑛𝑛�
�
2
will have a chi-square distribution with 1 degree of freedom.
Hence, if the two terms on the right hand side of the above equation are independent (which will be
assumed as true here), then it follows that
(𝑛𝑛−1) 𝑠𝑠2
𝜎𝜎2 has a chi-square distribution with (n — 1) degrees
of freedom. One degree of freedom is lost because the deviations are measured from 𝑥𝑥̅ and not
from 𝜇𝜇.
Expected Value and Variance of 𝒔𝒔𝟐𝟐
27. 26
The mean of a chi-square distribution is equal to its degrees of freedom and the variance is equal to
twice the degrees of freedom. This can be used to find the expected value and the variance of 𝒔𝒔𝟐𝟐
.
Since
(𝑛𝑛−1) 𝑠𝑠2
𝜎𝜎2 has a chi-square distribution with (n-1) degrees of freedom,
∴ 𝐸𝐸 �
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 � = 𝑛𝑛 − 1 Or
(𝑛𝑛−1)
𝜎𝜎2 . 𝐸𝐸 (𝑠𝑠2) = 𝑛𝑛 − 1
∴ 𝐸𝐸 (𝑠𝑠2) = 𝜎𝜎2
Also, Var �
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 � = 2(𝑛𝑛 − 1)
Using the definition of Variance, we get
𝐸𝐸 �
(𝑁𝑁−1)𝑆𝑆2
𝜎𝜎2 − 𝐸𝐸 �
(𝑁𝑁−1)𝑆𝑆2
𝜎𝜎2 ��
2
= 2(𝑁𝑁 − 1) Or, 𝐸𝐸 �
(𝑛𝑛−1)𝑠𝑠2
𝜎𝜎2 − (𝑛𝑛 − 1)�
2
− 2(𝑛𝑛 − 1)
Or,
(𝑛𝑛−1)2
𝜎𝜎4 𝐸𝐸 (𝑠𝑠2
− 𝜎𝜎2)2
= 2(𝑛𝑛 − 1) ∴ 𝐸𝐸 (𝑠𝑠2
− 𝜎𝜎2)2
=
2𝜎𝜎4
(𝑛𝑛−1)
i.e 𝑉𝑉𝑉𝑉𝑉𝑉 (𝑠𝑠2) =
2𝜎𝜎4
𝑛𝑛−1
since the expected value of 𝑠𝑠2
is equal to 𝜎𝜎2
.
It can therefore, be conclude that if we take a large number of samples, each with a sample size on
n, from a normal population with mean 𝜇𝜇 and variance 𝜎𝜎2
, each sample will perhaps have a different
value for its sample variance 𝑠𝑠2
. But the average of a large number of values of 𝑠𝑠2
will be close to
𝜎𝜎2
. Also, the variance of 𝑠𝑠2
falls as the sample size increases.
Its important to note here that all the above conclusions are based on the assumption that the
population is distributed normally. If the population does not have a normal distribution, then
nothing can be said about the distribution of 𝑠𝑠2
.
2.5 Sampling Distribution of the Proportion
Let us assume that 0.80 of all students in a school can pass a test of physical fitness. A random
sample of 20 students is chosen: 13 passed and 7 failed. The parameter π is used to designate the
proportion of subjects in the population that pass (0.80 in this case) and the statistic p is used to
designate the proportion who pass in a sample (13/20= 0.65 in this case). The sample size (N) in this
example is 20. If repeated samples of size N where taken from the population and the proportion
passing (p) were determined for each sample, a distribution of values of p would be formed. If the
sampling went on forever, the distribution would be the sampling distribution of a proportion. The
sampling distribution of a proportion is equal to the binomial distribution. The mean and standard
deviation of the binomial distribution are:
𝜇𝜇 = 𝜋𝜋 and 𝜎𝜎𝑝𝑝 = �
𝜋𝜋(1−𝜋𝜋)
𝑁𝑁
For the present example, N = 20, π = 0.80, the mean of the sampling distribution of 𝑝𝑝(𝜇𝜇) is 0.8 and
the standard error of 𝑝𝑝�𝜎𝜎𝑝𝑝� is 0.089. The shape of the binomial distribution depends on both N and
28. 27
π. With large values of N and values of π in the neighborhood of 0.5, the sampling distribution is very
close to a normal distribution.
Assume that for the population of people applying for a job at a bank in a major city, 0.40 are able
to pass a basic literacy test required to get the job. Out of a group of 20 applicants, what is the
probability that 50% or more of them will pass? This problem involves the sampling distribution of p
with π = 0.40 and N = 20. The mean of the sampling distribution is π = 0.40. The standard deviation
is:
𝜎𝜎𝑝𝑝 = �
𝜋𝜋(1−𝜋𝜋)
𝑁𝑁
= �
0.40(1−0.40)
20
= 0.11
Using the normal approximation, a proportion of 0.50 is: (0.50-0.40)/0.11 = 0.909 standard
deviations above the mean. From a z table it can be calculated that 0.818 of the area is below a z of
0.909. Therefore the probability that 50% or more will pass the literacy test is only about 1 - 0.818 =
0.182.
2.6 The Confidence Level
The sample mean is researchers estimate of the population mean. If we are asked to give an interval
as our estimate, then we would add a range on the upper and the lower side of the sample mean
and give that interval as our estimate. The larger the interval, the greater is our confidence that the
interval does contain the true population mean. It is to be noted that the true population mean is a
constant and is not a variable. On the other hand, the interval that we specify is a random interval
whose position depends on the sample mean. For example if the sample mean is 50 and the standard
error of the mean is 5, we may specify our interval estimate as (45,55) i.e. from 45 to 55 which spans
one standard error of the mean on either side of the sample mean. On the other hand, if the interval
estimate is specified as (40,60) i.e. spanning two standard errors of the mean on either side of the
sample mean, we are more confident that the latter interval contains the true population mean as
compared to the former. However, if the confidence level is raised too high, the corresponding
interval may become too wide to be of any practical use.
The confidence level, therefore, may be defined as the probability that the interval estimate will
contain the true value of the population parameter that is being estimated. If we say that a 95%
confidence interval for the population mean is obtained by spanning 1.96 times the standard error
of the mean on either side of the sample mean, we mean that we take a large number of samples of
size n, say 1000, and obtain the interval estimates from each of these 1000 samples and then 95%
of these interval estimate would contain the true population mean.
Confidence Interval for the Population Mean
Let us now discuss how to obtain a confidence interval for the population mean. We shall assume
that the population distribution is normal and that the population variance is known. Later, we shall
relax the second condition.
Suppose it is known that the weight of cement in packed bags is distributed normally with a standard
deviation of 0.2 Kg. A sample of 25 bags is picked up at random and the mean weight of cement in
29. 28
these 25 bags is only 49.7 Kg. We want to find a 90% confidence interval for the mean weight of
cement in filled bags.
Let x be a random variable representing the weight of cement in a bag picked up at random. We
know that x is distributed normally with a standard deviation of 0.2 Kg.
The standard error of the mean can be easily calculated as
𝜎𝜎𝑥𝑥̅ =
𝜎𝜎
√ 𝑛𝑛
=
0.2
√25
= 0.04 𝐾𝐾𝐾𝐾
We can use the above approach when the population standard deviation is known or when the
sample size is large n > 30 , in which case the sample standard deviation can be used as an estimate
of the population standard deviation. However, if the sample size is not large, as in the example
above, then one has to use the t distribution in place of the standard normal distribution to calculate
the probabilities. Let us assume that we are interested in developing a 90% confidence interval in
the same situation as described earlier with the difference that the population standard deviation is
now not known. However, the sample standard deviation has been calculated and is known to be
0.2 Kg.
Since the sample size n = 25, we know that
𝑥𝑥̅− 𝜇𝜇
𝑠𝑠
√ 𝑛𝑛�
follows a t-distribution with 24 degrees of freedom.
From t-tables, we can see that the probability that a t statistic with 24 degrees of freedom lying
between - 1.711 and 1.711 is 0.90 -i.e. the probability that 𝑥𝑥̅ lies between −1.711 𝑠𝑠 √𝑛𝑛⁄ and
+1.711 𝑠𝑠 √𝑛𝑛⁄ is 0.90.
In other words, if we use an interval spanning from (𝑥𝑥 − 1.711 𝑠𝑠 √𝑛𝑛⁄ ) to (𝑥𝑥 + 1.711 𝑠𝑠 √𝑛𝑛⁄ )
then 90% of the time, this interval will contain 𝜇𝜇 . Hence, for a 90% confidence interval,
The lower limit = 𝑥𝑥̅ − 1.711
𝑠𝑠
√ 𝑛𝑛
or 49.7 − 1.711
0.2
√25
or 49.6316
And the upper limit = 𝑥𝑥̅ + 1.711
𝑠𝑠
√ 𝑛𝑛
or 49.7 + 1.711
0.2
√25
or 49.7684
In this case, we can state with 90% confidence level that the mean weight of cement in a filled bag
lies between 49.6316 Kg and 49.7684 Kg.
Using the derivations and relations we can calculte the sample size that will be ideal for a particular
study for an expected confidence level.
***
30. 29
Bibliography
1. http://www.nku.edu/~statistics/212_Sampling_Distribution_of_P-hat.htm
2. http://en.wikipedia.org/wiki/Sampling_distribution
3. http://en.wikipedia.org/wiki/Sampling_(statistics)
4. http://onlinestatbook.com
5. Course material on ‘Quantitative analysis for Managerial Applications’, MS-8, 1997, IGNOU,
Maidan Garhi, New Delhi.
6. Course material on ‘Research Methodology for Management Decisions’, MS-95, 1997, IGNOU,
Maidan Garhi, New Delhi.
7. http://stattrek.com/sampling/sampling-distribution.aspx
8. http://www.psychstat.missouristate.edu/introbook/sbk19.htm
9. http://www.stat.berkeley.edu/~stark/SticiGui/Text/index.htm
10. http://www.fao.org/docrep/w7295e/w7295e08.htm#6
***