In any sampling problem there is a relevant population , the set of all members about which the study intends to make inferences.
Before we select a sample from a given population, we typically need a list of all members of the population. This list is called the frame , and the potential sample members are called sampling units .
There are two type of samples, probability samples and judgmental samples.
A probability sample is a sample in which the sampling units are chosen from the population by means of a random mechanism such as a random number table.
No formal random mechanism is used to select a judgmental sample , in this case the sampling units are chosen according to the sampler’s judgment.
The simplest type of sampling scheme is appropriately called simple random sampling.
The idea is very simple. We first generate a column of random numbers in column C. Then we sort the rows according to the random numbers and choose the first 10 families in the sorted rows.
The following procedure produces the results.
Random numbers. Enter the formula =RAND() in cell C10 and copy it down column C.
Replace with values. To enable sorting we must “freeze” the random numbers - that is, replace their formulas with values. To do third, select the range C10:C49 use Edit/Copy and then use Edit/Paste Special with the Values option.
Copy to a new range. Copy the range A10:C49 to the range E10:G49.
Sort. Select the range E10:G49 and use the Data/Sort menu item. Sort according to the Random # column in ascending order. Then the 10 families with the 10 smallest random numbers are the ones in the sample.
Means. Use the AVERAGE, MEDIAN and STDEV functions in row 6 to calculate summary statistics of the first 10 incomes in column F.
If we would like more random samples of size 10 we would need to repeat the process repeatedly.
To save you the trouble, we have setup a macro to automate the process. See the Automated sheet of the RANDSAMP.XLS file. By clicking on the button we get a different random sample.
10.
Example 8.2 Methods for Selecting Random Samples
This file contains 280 accounts receivable for the Spring Mills Company. There are three variables:
Size: customer size (small, medium, large), depending on its volume of business with Spring Mills
Days: number of days since the customer was billed
Amount: amount of the bill
Generate 50 random samples of size 15 each from the small customers only, calculate the average amount owed in each random sample, and construct a histogram of these 50 averages.
To select small accounts only, insert blank row after account 150 (the last small account).
Then, with the cursor anywhere in the small account data set, use the StatPro/Statistical Inference/Generate Random Samples enter 50 and 15 as the number of samples and the sample size, and put the results in a new sheet.
To find the amounts owed for the sampled accounts, enter the formula =VLOOKUP(B3,Data!Data,4) in cell B21 and copy it to the range B21:AY35.
Then calculate the average in row 37 with the AVERAGE function and transpose this row of average to a column of averages in BA4:BA53 with the formula =TRANSPOSE(B37:AY37) and pressing Ctrl-Shift-Enter.
Use StatPro’s histogram procedure to create a histogram - each will look different because of the random numbers selected.
The histogram indicates the variability of sample means we might obtain by selecting many different random samples of size 15 from this population of small customer accounts.
16.
Example 8.3 Methods for Selecting Random Samples
This file contains a frame of all 1000 people in the city of Smalltown who have Sears credit cards.
Sears is interested in estimating the average number of other credit cards these people own, as well as other information about their use of credit.
The company decides to stratify these customers by age, select a stratified sample of size 100 with proportional sample sizes, and then contact these 100 people by phone.
Suppose we can identify various subpopulations within the total population. We call these subpopulations strata .
It makes sense to select a simple random sample from the stratum instead of from the entire population. This is called stratified sampling .
This method is particularly useful when there is considerable variation between the various strata but relatively little variation within a given stratum.
To obtain a stratified random sample we must choose a total sample size n , and we must choose a sample size n i for each stratum i .
There are many ways to choose these numbers but the most popular method is proportional sample sizes .
The advantage of proportional sample sizes is that they are very easy to determine. The disadvantage is that they ignore differences in variability among the strata.
To see what age category each customer falls in we enter the formula =IF(B11<=$D$6,1,IF(B11<=$D$7,2,3)) in cell C11 and then copy it down column C.
Next, it is useful to “unstack” the data into three groups, one for each age category.
It is easy to unstack the data in columns A-C.
With the cursor anywhere in A10:C1010 select StatPro/Data Utilities/Unstack Variables. Select Category as the Code variable, select Cust and Age as the variables to unstack, and accept the default location for the unstacked variables.
Once the variables are unstacked we can calculate the counts and sample sizes in F6:G8 with the formulas =COUNT(E11:E142) and =ROUND(TotSampSize*F6/1000,0) .
Finally, we proceed by copying the data in columns E and F into L and M and append a a column of random numbers, sort on the random number column and choose the first 13 (or how ever many) customers.
The file shows the calculations for the other categories.
The advantage of cluster sampling is sampling convenience (and possibly less cost).
It is straightforward to select a cluster sample. The key is to define the sampling units as the clusters, then select a simple random sample of clusters. Then sample all the population members in each selected cluster.
When all sampling units within each cluster are taken it is called a single stage sampling scheme.
Real applications are often more complex and result in multistage sampling schemes .
An internal auditor for a furniture retailer wants to estimate the average of all accounts receivable taken over the population of all customer accounts.
The company has approximately 10,000 accounts. An exhaustive enumeration is impossible.
Therefore, the auditor randomly samples 100 of the accounts. This file contains the observed data.
There are two basic sources of errors that can occur when we sample randomly from a population:
Sampling error results from “unlucky” samples.
Nonsampling errors, which are quite different, can occur for a variety of reasons.
Nonresponse bias is when a portion of the sample fails to respond to the survey.
Nontruthful responses are particularly a problem when asked sensitive questions. One solution is to use a randomized response technique by giving two sensitive questions: one sensitive, one innocuous.
Measurement error occurs when the responses to the questions do not reflect what the investigator had in mind.
The receivables for the 100 sampled accounts appear in column E.
We calculate the sample mean and the sample standard deviation. Then we calculate the (approximate) standard error of the mean with the formula =Sstdev/SQRT(SampSize) in cell B9.
The auditor should interpret these values as follows:
The sample mean can be used to estimate the unknown population mean. It provides a best guess for the average of the receiveables for the 10,000 accounts.
The standard error provides a measure of accuracy.
The auditor can be 95% certain that the mean from all 10,000 accounts is within the interval $279 + or - $84, that is, between $195 and $363.
What does this experiment have to do with random sampling?
Here, the population is the set of all outcomes we could obtain from a single spin of the wheel; that is, all dollar values from $0 to $1000. Each spin results in one randomly sampled dollar value from the population.
Furthermore, because we have assumed that the wheel is equally likely to land in any position, all possible values in the continuum from $0 to $1000 have the same chance of occurring.
In order to analyze the distribution of winnings based on the average of n spins we need to do a sequence of simulations for n =1, n =2, n=3, n =6 and n =10.
This spreadsheet contains the simulation for n =1. The other simulations can be found in the following spreadsheets, SPIN2.XLS , SPIN3.XLS , SPIN6.XLS , and SPIN10.XLS .
For each simulation we consider 1000 replications of an experiment.
The experiment simulates n spins of the wheel and calculates the average - that is, the winnings - from the n spins.
Based on these 1000 replications, we can then calculate the average winnings, the standard deviation of winnings, and a histogram of winnings for each n . These will show clearly how the distribution of winnings depends on n .
The following slide shows the results for n =1. Here, there is no averaging.
To replicate the experiment 1000 times and collect statistics, we proceed as follows.
Random outcomes. To generate outcomes uniformly distributed between $0 and $1000 we enter the formula =$B$3RAND( ) *($B$4-$B$3) in cells B11 and copy it down column B. The effect of this formula is to generate a random number between 0 and 1 and multiply it be $1000.
Summary measures. Calculate the average and standard deviation of the 1000 winnings in column B with the AVERAGE and STDEV functions. These values appear in cells E4 and E5.
The following slide contains the chart and graph of the n =2 simulation.
To do this we formed a second column of outcomes in column C corresponding to a second spin in each experiment. We average the values in column B and C to obtain each of the winnings in column D.
The average winnings is very close to $500, but the standard deviation is much lower and the histogram is no longer flat.
The histogram is now triangular shaped - symmetric, but not yet bell shaped.
To develop similar simulations for n =3, n =6, n =10, or any other n , we insert additional outcome columns and make sure that the AVERAGE formula in the Winnings column average all n outcomes to its left.
This behavior is exactly as the central limit theorem predicts.
For any population distribution with mean mu , the sampling distribution of the sample mean is approximately normal with the mean mu and the standard deviation , and the approximation as n increases.
If fact, because the population distribution is symmetric in this example - it’s flat - we see the effect of the theorem for n much less than 30; it is already evident for n as low as 6.
A marketing researcher has been hired by a videocassette rental company to estimate the average number of videocassettes rented annually by households in a particular metropolitan area.
The researcher decides to determine the sample size that makes the maximum probable absolute error approximately equal to 10,
The determination of sampling size is usually driven by sampling error considerations.
The usual procedure is to select an acceptable sampling error B called the maximum probable absolute error by using the equation
The implication is that if we randomly sample many members from the population, then there is a 95% chance that the resulting sampling error will be no greater than B in magnitude.
To use this equation she must estimate a value of .
Based on her knowledge of the industry and available historical data, she uses a best guess of sigma =50.
She then uses the values from C7 and C8 to find the required sample size in C10 with the formula =4*PopStDev^2/MaxAbsErr^2
Finally, she takes a sample of size 100 and observes the sample values shown in column F. Based on this sample, we calculate summary measures in the usual way in the range C13:C16.
The absolute error in cell C16 is 2 times as great as the standard error in cell C15.
It is slightly higher than the maximum absolute error she specified in cell C8 because she observed a larger standard deviation than she had guessed.
In other words, the fact that there is evidently more variation in the population than she thought makes her sample mean based on 100 households slightly less accurate than she intended.
Be the first to comment