Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Sampling - What, Why and How

1,140

Published on

Sampling, what, why and how. Complete Guide to Statistical Sampling.

Sampling, what, why and how. Complete Guide to Statistical Sampling.

4 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total Views
1,140
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
44
0
Likes
4
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. SamplingDate: 12/02/2013Author: K. S. Alok RanjanAbout: Meaning, Types and Formulas of Sampling inStatistics.
• 2. Sampling, What, Why and How Feb-2013 I. Sampling – Meaning and Need Sampling, in Simple terms refers to choosing few individual entities from a complete group of entities for the purpose of assessing characteristics or qualities of the complete group. For example: a. Choosing some individuals from a city for a poll whether the complete city will vote or not. b. Choosing some iron rods from a manufacturing plant to assess if the complete production meets a certain quality standard c. Choosing some patients of a particular disease to figure out if every patient suffering from the same disease has a particular symptom or what their reaction will be to a particular medicine. Sampling is a Statistical Survey Methodology that helps to select a subset of Individuals from a Population to estimate Characteristics about the complete Population. In the above examples:  Populations are Complete City of People, Complete Production of Manufacturing Plant and All the People suffering from the disease.  Subsets are some individuals in the city, some iron rods from the plant and some patients suffering from same disease.  Characteristics to find are Voting Possibility, Certain Quality Standards and Disease Symptoms/Medicine reaction II. Stages in Sampling 1. Defining Population: Clearly define who the complete population is. It would eliminate any possibility of biasedness and ensure that the sample taken is correct. For e.g. all persons suffering from the disease in above example. 2. Deciding Sampling Frame: Sampling Frame is a set of elements of the Population that can be used to extract Samples. For e.g. Contact Information of individuals in Poll example above. 3. Deciding Sampling Method: There are different kinds of Sampling Methods that we will study below. 4. Determining the Sample Size: The size or volume of sample can be statistically determined using certain formulas that also we would study below. 5. Planning the Sampling Implementation: Device a strategy on how to go about collecting the samples. 6. Collecting the Data: Collation of data on the basis of characteristics decided before from the samples.III. Population, Subpopulation, Frame and Sample A Population in Statistics means a set of entities (identifiable individuals who can be studied alone for any purpose are entities) who are bound by some common measurable characteristics. Generally they are found in a group. For e.g. all students in Delhi who are in any DU college. A Subpopulation is a subset within the Population that inherits the characteristics of the Population and also maintains some unique characteristics that is not present in other distinct subpopulations inside the Population. For e.g. each college under DU is a subpopulation, or all males and all females are two subpopulations. A Frame is a mechanism that helps in picking Sample from a Population. Note that there has to be an instrument that helps in contacting the Samples and including them in the survey. This instrument can be either a telephone directory, University Magazine, Patient list etc. So, simply put, a Frame is a list of the Population (preferably complete Population) that also has a medium to help pick Samples. For e.g. Enrolment forms of an academic year. Sample is a subset of the Population, chosen using the Frame so that they can be studied for certain characteristics that can later be generalized for the Population. Few of the students from any college of DU 2 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 3. Sampling, What, Why and How Feb-2013 selected from their Enrolment forms.IV. Types of Sampling Types of Sampling signifies the different (two in this case) categories of Sampling based on the type of input/output or behaviour of the input/output. 1. Probability Probability Sampling means every Unit in the Population has less or more, but valid chance of being selected as a sample. And also, this valid chance can be statistically measured. For e.g. in a city if each home and hospital is searched for a particular type of patient and identify the patients, then randomly select one patient from the city, each patient has a valid chance of getting selected. May be more, in case the person is a single patient in home or less, in case there are more than one such patient in a hospital. But, the valid chance, or probability in statistical language, remains for each patient. This is Probability Sampling. In case the Probability is equal for each Unit in the Population, it is called EPS, Equal Probability of Selection. An example can be, searching for a patient with a particular disease only in one hospital. 2. Non Probability Non Probability Sampling methods are the ones in which some Units of the Population does not have any valid chance or the chance cannot be known before, of getting selected in the Sampling. Non probability Sampling happens when assumptions are used to sample from a Population. For this reason, sampling errors cannot be determined. It further gives birth to Biasedness due to Exclusion, precisely meaning, the Population might not be properly estimated from the Sample. An example can be, visiting only hospitals in a city to find out the patients from a particular disease, and not visiting the homes in the city.V. Calculating Sample Size Let’s calculate the sample size of how many quality samples should be done in a Customer Service Process which handles 50,000 calls a month. * ( )+ Where: SS = Sample Size to be calculated ( ) Pop = Population p = Per cent of Population that you expect will satisfy or not satisfy the criteria of reason why you are sampling. For e.g. 30% of population is meeting quality standards and 70% is not. This is expressed as a decimal and generally taken as 50% or 0,5. Any per cent greater or lesser than this would reduce the sample size. 50% (0.5) will maximize the sample to include most of the population. 3 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 4. Sampling, What, Why and How Feb-2013 Z = Confidence Level (If you do manual check of complete population for the criteria, like Quality check as in our example, how many time will the p you took above will be correct, 90, 95 or 99 times?) Generally taken at 95%. In the formula, use - 1.645 if you are 90% sure 1.960 if you are 95% sure 2.576 if you are 99% sure C = Confidence Interval (Error Margin allowed between what may happen with Sample and what should happen in population) Margin of error allowed in sample, against (if hypothetically) quality is done for complete population. Expressed as decimal, as it for 3% error, 0,03. Example: Population of 50,000 (means, 50,000 calls in a month) P = 50% 0r 0.5 (Because I think half of my population will flunk in quality and half may not, and this way I can assure highest sample) Z = 1.960 (Because I am confident that if I do Quality of whole population 100 times, 95 times P above will be correct) C = 0.03 (Because I want to allow 3% of sampling error, that is P may vary from 47% to 53% but no more or less) So, ( ) ( ) ( ) ( ) And, now, * ( )+ * ( )+4 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 5. Sampling, What, Why and How Feb-2013 [ ] So, Sample size in a month for a 50,000 calls should be 1045.VI. Sampling Methods Method of Sampling signifies the different ways of calculating sample size. This list will generally differ from one Statistician to other or from one Six Sigma expert to another. This is because the interpretation may result in merging two Methods into one or splitting a Method into two. Since the Methods are situational and to be decided strictly as per the kind of Population you are handling and the kind of analysis you are looking it, I have listed here the almost exhaustive list of Sampling Methods that you may choose from. 1. Simple Random Sampling: Simple Random Sampling is a Probability Sampling. It is choosing a sample (a subset of individuals) from a Population (larger set). Each individual is chosen randomly and entirely by chance, such that each individual has the same probability of being chosen at any stage during the sampling process Simply put, it states that once Sample Size is calculated, the number of Samples to be chosen from the Population has to be chosen in such a way that each entity in the Population has equal chance to be chosen. For e.g. if the Enrolment forms are kept in a box and randomly number of forms are chosen as specified by the Sample Size. Advantage: i. Minimizes bias and simplifies analysis. ii. Variance depicted in Sample is almost correct for the Population. Disadvantage: i. Might not reflect the makeup of the Population, like number of boys and girls in all DU colleges. ii. Tiresome and clumsy in case of a large Population. 2. Systematic Sampling: Systematic Sampling is a Probability Sampling. In this, once Sample size is determined, an interval is created and Samples are chosen from the intervals systematically. The procedure is as below: Divide the Population by the Sample size to arrive at k. Start from an entity between 1 and k. Choose each kth entity from the Population starting from initial k. Once the end of population is reached, rotate back to the beginning of the Population cyclically. Continue choosing until the Sample size is reached. For e.g. from a Population of 300, if Sample size is 12, choose every (300/12) th = 25th entity starting from any th random number between 1 – 25. Choose each successive 25 entity from the starting entity until Sample size is reached. However, Population will be very rarely divided by the Sample evenly. For e.g. if Sample size is 9, then (300/9) = 33.33. In these cases, chose a starting point between 1 and 33.33 and round up each successive entity to one up. For e.g. if you start from 23.6, then start selecting 24, 57, 91… and so on. 5 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 6. Sampling, What, Why and How Feb-2013 Advantage: i. Efficient for Databases. ii. Very efficient for Data with gradual trend and slope. Disadvantage: i. Data with periodicity will be heavily biased. If a the Sample frame has alternate boys and girls name, Systematic Sampling will only choose either all boys or all girls. ii. Variations between neighbouring entities are never captured.3. Stratified Sampling: A Population may have different Subpopulations that are independent homogenous groups who contribute to the characteristics of the Population, but have unique set of their own characteristics. The Subpopulations are homogenous internally but heterogeneous with each other. As per Stratified sampling, the Population is divided into Strata or Subpopulations as per the uniqueness of each Strata. It is to be taken care that no entity is in more than one Strata neither is an entity left out of the Population. Then in each Subpopulation or Strata Simple Random Sampling or Systematic Sampling is applied. For e.g. if all the students in all colleges of DU is the Population, Strata can be each college, or each academic area of colleges combined (Science, Commerce), or Geographical origin of students (North India, East India). While doing a Stratified Sampling, it is to be taken care that the proportion of each Strata in the Population is reflected in the Samples. For e.g. if there are 30% of males and 70% of females and Strata are males and females, then a Sample should have 3 males to 7 females. Also, if a Subpopulation has more of Standard Deviation, larger samples should be taken from it than the Subpopulation is lesser Standard Deviation. A Population should not be divided into more than six Strata. Advantage: i. Sample represents the Population better, Sampling Error reduces. Subpopulations with more importance can be weighted more. ii. Different Sampling Methods can be exercised for different Subpopulations. iii. Sampling from a Population over a wide geographical area is more accurate. Disadvantage: i. Cannot be applied to large Population where Subgroups may be not distinctly disjoint or entities have characteristics that are liable to make them a part of more than one Subpopulation. ii. Scope of Sampling error increases with the number if Subpopulations in a Population. iii. Can be expensive to implement.4. Probability Proportional to Size Sampling: If there are more than one Subpopulation with varying size of entities each, PPS Sampling ensures that the Probability of an entity being selected as a Sample increases or decreases Proportional to the size of its Subpopulation. In this case, each Subpopulation is sorted in increasing order; each one is given a number, (number for Subp1 = 1 to number of entities in Subp1, number for Subp2 = number for Subp1 + 1 to number in Subp2… 6 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 7. Sampling, What, Why and How Feb-2013 ). Then k (as in Systematic Sampling) is calculated (k = Population/Sample). Then each kth entity is chosen from the Subpopulation numbers we had arrived before. For e.g. in a Population of all students in all colleges of DU, a Subpopulation of each college will have number of entities (students) which has considerable Variance between them. If we have a Sample Size of 25 to select from 3100 students in DU colleges with 5 colleges: DU = Population = 3100 Subpopulation = 5 colleges Sample Size derived = 25 Number Calculation for each Subpopulation First sort and list the Subpopulation in increasing order. Colleges College 1 College 2 College 3 College 4 College 5 Subpopulation 340 510 620 750 880 1 341 851 1471 2221 numbers to to to to To 340 850 1470 2220 3100 Entities in least populated Number calc 340+510 850+620 1470+750 2220+880 Subp Calculation of k = Population/Sample = 3100/25 = 124 Randomly select first Sample between 1 and 124, say 113, then 113+124 = 237, 237 + 124 = 361… and we get the below Table and derivation at right side: Sample Number College The Table at Left states that the below number of 1 113 College 1 Samples should be collected from each Subpopulation (College): 2 237 College 1 3 361 College 2 Sample Subpopulation 4 485 College 2 Size 5 609 College 2 College 1 2 6 733 College 2 College 2 4 7 857 College 3 College 3 5 8 981 College 3 College 4 6 9 1105 College 3 College 5 8 10 1229 College 3 DU 25 11 1353 College 3 Which Sums to 24, the Sample Size. If you see, larger 12 1477 College 4 Samples are resulted from Subpopulations with larger 13 1601 College 4 number of entities. 14 1725 College 4 15 1849 College 4 16 1973 College 4 17 2097 College 4 18 2221 College 5 19 2345 College 5 20 2469 College 5 21 2593 College 5 22 2717 College 5 23 2841 College 5 24 2965 College 5 25 3089 College 57 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 8. Sampling, What, Why and How Feb-2013 Advantage: i. Sample concentration on larger Subpopulation, increasing the representativeness of Sample. ii. Counters the disadvantages of Systematic and Stratified Sampling when Subpopulations have Variance between them . Disadvantage: i. Fails to account for negative balances while Sampling for a Business’ Finance data. ii. Decreases precision of estimates; thus, requires larger sample size.5. Cluster Sampling: Cluster sampling is a method in which the Population is divided into Clusters taking care that each Cluster has the entire characteristic that the Population as a whole has. Then one or more Clusters are taken as Sample/s, leaving rest of the Clusters untouched. The difference between a Stratified Sampling and Cluster Sampling is, in Stratified Sampling, Sample has to come from each Strata, and in Cluster Sampling, Sample can come from one or more Cluster only. The other and basic difference is, Strata are internally homogenous however heterogeneous with each other; Clusters are internally heterogeneous however homogenous to each other. For e.g. one student of each college in DU in an inter-college competition would be a Cluster for the Population DU. Advantage: i. Cheaper than other Methods. ii. Sampling Frame for complete Population is not needed. Disadvantage: i. Sampling error possibility is high. Extra care needed to choose a Cluster. ii. Requires larger Sample than SRS or Systematic Sampling for similar accuracy.6. Multistage Sampling: Multistage Sampling is a complex form of Cluster Sampling with multiple levels selection. After identifying Clusters in a Population, the second stage is to randomly select Samples from each Cluster. Sometimes when Population is huge or not completely available, multiple stages of Cluster Selection may be applied before final Sample is collected. For e.g. if students I all the colleges under DU is the Population, and we need to find out about student involvement in National level Competitions, first level Cluster would be all the students (from each college under DU) participating in each competition, then from each Cluster students can be picked either using Systematic Sampling or SRS. Advantage: i. Cost and speed are optimized, convenience to researchers is assured. ii. Less Sampling error than normal Cluster Sampling for same size Sample. Disadvantage: i. Less accurate than SRS for same Sample Size. ii. Not much testing and analysis can be done on Sample. 8 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 9. Sampling, What, Why and How Feb-20137. Multiphase sampling: Multiphase Sampling refers to a method where a part of Sample is collected from the main Sample Size and rest is collected from a subset of the main Sample. It ensures that some part of the Sample provides more information than the others. Basically, the sub samples provide more detailed information. For e.g. if all the students in colleges under DU is the Population and we need to find out which students are speak fluent Tamil and can teach Basic Statis in Tamil, a large Sample of South Indian students can be separated and asked, “Are you from Tamil Nadu?” then the sub Sample of students who confirm that they are form Tamil Nadu can be asked if they speak Tamil and can teach Basic Stats. Advantage: i. Useful when Sampling Frame lacks auxiliary information for Stratified Sampling. ii. Cost effective when budget is not available for complete Sample information collation. Disadvantage: i. The planning and implementation is complicated. ii. Time consuming.8. Quota Sampling: Quota Sampling is a Non-Probability Sampling. This method asks to segment the Population into mutually exclusive Subpopulations. Then a pre-determined judgement is used to pick Samples from each Subpopulation. For e.g. from the Population of all students in DU colleges, after defining Subpopulation as each colleges, 20 female students with entrance exam marks between 75% and 85% are to be chosen. Researcher may choose any 20 females from any colleges randomly, may be based on the language of the student easy to understand. Advantage: i. Samples have probability of getting biased. ii. This method is incredibly cheap. Disadvantage: i. Limits decisions. ii. Does not allow variety in Sample. iii. Not possible to assess Sampling error as it is not random.9. Accidental Sampling: Accidental Sampling is like Snowball Sampling in Social Science Research and it also called Convenience/Grab/Opportunity Sampling. It is a Non Probability Sampling and consists of collecting Sample from the part of Population that is easy to access or is readily available. However, it should be ensured that research is equipped with fail safe to lessen the impact of the non- randomness. Also, it should be ensured that the Convenience Sample has reason to represent the Population. For e.g. Sampling the students in a particular gift shop nearest to one of the colleges under DU. 9 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 10. Sampling, What, Why and How Feb-2013 Advantage: i. Useful for Pilot Testing of a product or service, where target user is not particular. ii. Cost effective. Disadvantage: i. Sampling error possible due to non-randomness of the Sample. ii. High probability of Sample not representing Population exists.10. Line Intercept Sampling: Line Intercept Sampling is a Non Probability sampling Method generally applied to Samples across an area where the Samples are stationary or relatively very less mobile, for e.g. patches in a certain habitat type, herbs and vegetation, rocks on a mountain, relatively less mobile animals like cows grazing in a field. Lines, often called Transects are drawn through the area and any entity falling in the line of the Transect are taken as Sample. Either the transect is drawn through diagonal if area is square or rectangle or more than one Transects if area has random circumference. Generally it is used in Biological studies or Vegetation and Geographical Data collection. Advantage: i. Simple method of Sample collection. ii. Useful for Populations who are not mobile and cannot participate in selection process. Disadvantage: i. Since it is Non Probability Sampling, some Sample do not have chance to get selected. ii. Cannot be applied to all kind of data collection.11. Panel Sampling: Panel sampling is mostly used for Social Science Research. It consists of selecting a Sample Size using any Random Sampling and then extracting same information more than once from the Sample over a period of time. Information extracted each time is called a Wave. It is like studying Repeatability for Gage RnR in Six Sigma. For e.g. post selecting the Sample of students from all colleges under DU, asking the students if they will join family business or go for a job at the beginning of each year once and studying the Variance in their answers. It is a very useful Sampling Method, carefully done can give useful analysis using MANOVA, Growth Curve etc. about people and their changing views. Advantage: i. Useful for People Study, Political mileage trends. ii. Can help find out within-person health changes due to changing stress, time, prices etc. Disadvantage: i. Time consuming and can prove to be costly. ii. Cannot be applied to all kind of data collection.10 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457
• 11. Sampling, What, Why and How Feb-2013 12. The Judgement Sample: Judgement Sample is a Non Probability method in which either the Researcher or an Expert takes a Judgement on which entities are to be included in the Sample. Here the Sampling Frame and the Population are not identical, so there is scope of bias. This is appropriate if Population is difficult to locate or some part of the Population is known to have more data or knowledge or are receptive to data sharing then others. For e.g. from the Population of all students in colleges under DU, if the researcher chooses the ones who are into College Election to ask about current Political events in the country, it would be Judgement Sampling. Advantage: iii. Easy to determine Samples. iv. Useful for Population with definite expertise and skills. Disadvantage: iii. High scope of biasedness in Sample. iv. Expert’s or Researcher’s reliability evaluation is necessary.VII. End Notes Sampling is the first step in Analysis and a very important part of the complete Analysis Process. It forms the primary step in Measure Phase of Six Sigma. The complete chapter on Sampling above has been presented in as easy language as possible. However, there are few pointers listed below that needs further study. You can also contact me for any clarification: 1. Systematic Sampling, weighted method 2. Systematic Sampling vs. SRS 3. Standard Deviation while Sampling from Strata 4. Post Stratification and Over Sampling in Stratified Sampling 5. MANOVA, Growth Curve etc Will also come back with an Excel based Sample Size calculator where you can enter data knowing what is its significance. Until then, here is a very nice and simple calculator developed by Macorr that you can download and use. http://www.macorr.com/sample-size-calculator.htm [Courtesy: Macorr]VIII. References http://www.pitt.edu/~super7/43011-44001/43911.ppt http://en.wikipedia.org/wiki/Sampling_(statistics) www.hivhub.ir/fa/document-center/doc_download/161-probability-sampling http://encyclopedia2.thefreedictionary.com/multiphase+sampling http://archa1.blogspot.in/2007/04/multiphase-sampling.html http://www.businessdictionary.com/definition/quota-sampling.html http://www.blurtit.com/q788493.html http://www.jstor.org/discover/10.2307/2531331?uid=3738256&uid=2&uid=4&sid=21101797994437 http://www.math.montana.edu/~parker/PattersonStats/Lineint.pdf http://en.wikipedia.org/wiki/Judgment_sample http://www.htm.uoguelph.ca/MJResearch/ResearchProcess/JudgementSampling.htm 11 ask@sevensolutions.in | www.sevensolutions.in | +91 9810 77 5457