Lecture 4 Applied Econometrics and Economic Modeling


Published on

Applied Economic Modeling

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Lecture 4 Applied Econometrics and Economic Modeling

  1. 1. Methods for Selecting Random Samples
  2. 2. RANDSAMP.XLS <ul><li>This file contains data about the annual incomes of 40 families. </li></ul><ul><li>We want to choose a simple random sample of size 10 from this frame. </li></ul><ul><li>How can this be done? </li></ul><ul><li>And how do summary statistics of the chosen families compare to the corresponding summary statistics of the population? </li></ul>
  3. 3. Data
  4. 4. Sampling Terminology <ul><li>In any sampling problem there is a relevant population , the set of all members about which the study intends to make inferences. </li></ul><ul><li>Before we select a sample from a given population, we typically need a list of all members of the population. This list is called the frame , and the potential sample members are called sampling units . </li></ul><ul><li>There are two type of samples, probability samples and judgmental samples. </li></ul>
  5. 5. Sampling Terminology -- continued <ul><li>A probability sample is a sample in which the sampling units are chosen from the population by means of a random mechanism such as a random number table. </li></ul><ul><li>No formal random mechanism is used to select a judgmental sample , in this case the sampling units are chosen according to the sampler’s judgment. </li></ul><ul><li>The simplest type of sampling scheme is appropriately called simple random sampling. </li></ul>
  6. 6. Solution <ul><li>The idea is very simple. We first generate a column of random numbers in column C. Then we sort the rows according to the random numbers and choose the first 10 families in the sorted rows. </li></ul><ul><li>The following procedure produces the results. </li></ul><ul><ul><li>Random numbers. Enter the formula =RAND() in cell C10 and copy it down column C. </li></ul></ul><ul><ul><li>Replace with values. To enable sorting we must “freeze” the random numbers - that is, replace their formulas with values. To do third, select the range C10:C49 use Edit/Copy and then use Edit/Paste Special with the Values option. </li></ul></ul>
  7. 7. Solution -- continued <ul><ul><li>Copy to a new range. Copy the range A10:C49 to the range E10:G49. </li></ul></ul><ul><ul><li>Sort. Select the range E10:G49 and use the Data/Sort menu item. Sort according to the Random # column in ascending order. Then the 10 families with the 10 smallest random numbers are the ones in the sample. </li></ul></ul><ul><ul><li>Means. Use the AVERAGE, MEDIAN and STDEV functions in row 6 to calculate summary statistics of the first 10 incomes in column F. </li></ul></ul>
  8. 8. Results
  9. 9. More Random Samples Automatically <ul><li>If we would like more random samples of size 10 we would need to repeat the process repeatedly. </li></ul><ul><li>To save you the trouble, we have setup a macro to automate the process. See the Automated sheet of the RANDSAMP.XLS file. By clicking on the button we get a different random sample. </li></ul>
  10. 10. Example 8.2 Methods for Selecting Random Samples
  11. 11. RECEIVE.XLS <ul><li>This file contains 280 accounts receivable for the Spring Mills Company. There are three variables: </li></ul><ul><ul><li>Size: customer size (small, medium, large), depending on its volume of business with Spring Mills </li></ul></ul><ul><ul><li>Days: number of days since the customer was billed </li></ul></ul><ul><ul><li>Amount: amount of the bill </li></ul></ul><ul><li>Generate 50 random samples of size 15 each from the small customers only, calculate the average amount owed in each random sample, and construct a histogram of these 50 averages. </li></ul>
  12. 12. Generated Random Sample
  13. 13. Solution <ul><li>To select small accounts only, insert blank row after account 150 (the last small account). </li></ul><ul><li>Then, with the cursor anywhere in the small account data set, use the StatPro/Statistical Inference/Generate Random Samples enter 50 and 15 as the number of samples and the sample size, and put the results in a new sheet. </li></ul><ul><li>To find the amounts owed for the sampled accounts, enter the formula =VLOOKUP(B3,Data!Data,4) in cell B21 and copy it to the range B21:AY35. </li></ul>
  14. 14. Solution -- continued <ul><li>Then calculate the average in row 37 with the AVERAGE function and transpose this row of average to a column of averages in BA4:BA53 with the formula =TRANSPOSE(B37:AY37) and pressing Ctrl-Shift-Enter. </li></ul><ul><li>Use StatPro’s histogram procedure to create a histogram - each will look different because of the random numbers selected. </li></ul>
  15. 15. Solution -- continued <ul><li>The histogram indicates the variability of sample means we might obtain by selecting many different random samples of size 15 from this population of small customer accounts. </li></ul>
  16. 16. Example 8.3 Methods for Selecting Random Samples
  17. 17. STRATIFIED.XLS <ul><li>This file contains a frame of all 1000 people in the city of Smalltown who have Sears credit cards. </li></ul><ul><li>Sears is interested in estimating the average number of other credit cards these people own, as well as other information about their use of credit. </li></ul><ul><li>The company decides to stratify these customers by age, select a stratified sample of size 100 with proportional sample sizes, and then contact these 100 people by phone. </li></ul><ul><li>How might Sears proceed? </li></ul>
  18. 18. Systematic Sampling <ul><li>A systematic sample provides a convenient way to choose the sample. </li></ul><ul><li>It works as follows: </li></ul><ul><ul><li>First, we calculate the sampling interval as the population size divided by the sample size. </li></ul></ul><ul><ul><li>Next, we use a random mechanism to choose a number between 1 and 220 (Say number 131). </li></ul></ul><ul><ul><li>Then we choose the 131st name, the 351st name, the 571 and so on. The result is a systematic sample of size n=250. </li></ul></ul>
  19. 19. Stratified Sampling <ul><li>Suppose we can identify various subpopulations within the total population. We call these subpopulations strata . </li></ul><ul><li>It makes sense to select a simple random sample from the stratum instead of from the entire population. This is called stratified sampling . </li></ul><ul><li>This method is particularly useful when there is considerable variation between the various strata but relatively little variation within a given stratum. </li></ul>
  20. 20. Stratified Sampling -- continued <ul><li>To obtain a stratified random sample we must choose a total sample size n , and we must choose a sample size n i for each stratum i . </li></ul><ul><li>There are many ways to choose these numbers but the most popular method is proportional sample sizes . </li></ul><ul><li>The advantage of proportional sample sizes is that they are very easy to determine. The disadvantage is that they ignore differences in variability among the strata. </li></ul>
  21. 21. Solution <ul><li>First Sears must decide exactly how to stratify by age. </li></ul><ul><li>There reasoning is that different age groups probably have different attitudes and behavior regarding credit. </li></ul><ul><li>After preliminary investigation they decide to have three age categories: 18-30, 31-62, and 63-80. </li></ul><ul><li>The calculation goes as follows: </li></ul><ul><ul><li>the total sample size is cell C3 </li></ul></ul><ul><ul><li>the definitions of the strata in rows 6-8 </li></ul></ul><ul><ul><li>the customer data in range A11:B1010 </li></ul></ul>
  22. 22. Stratified Sample
  23. 23. Solution -- continued <ul><li>To see what age category each customer falls in we enter the formula =IF(B11<=$D$6,1,IF(B11<=$D$7,2,3)) in cell C11 and then copy it down column C. </li></ul><ul><li>Next, it is useful to “unstack” the data into three groups, one for each age category. </li></ul><ul><ul><li>It is easy to unstack the data in columns A-C. </li></ul></ul><ul><ul><li>With the cursor anywhere in A10:C1010 select StatPro/Data Utilities/Unstack Variables. Select Category as the Code variable, select Cust and Age as the variables to unstack, and accept the default location for the unstacked variables. </li></ul></ul>
  24. 24. Solution -- continued <ul><li>Once the variables are unstacked we can calculate the counts and sample sizes in F6:G8 with the formulas =COUNT(E11:E142) and =ROUND(TotSampSize*F6/1000,0) . </li></ul><ul><li>Finally, we proceed by copying the data in columns E and F into L and M and append a a column of random numbers, sort on the random number column and choose the first 13 (or how ever many) customers. </li></ul><ul><li>The file shows the calculations for the other categories. </li></ul>
  25. 25. Cluster Sampling <ul><li>Suppose a company is interested in various characteristics of households in a particular city. The sampling units are households. </li></ul><ul><li>We could proceed with the sampling methods discussed but it would be more convenient another way. </li></ul><ul><li>We could divide the city into city blocks as sampling units and then sample all the households in the chosen blocks. </li></ul><ul><li>In this case the city blocks are called clusters and the sampling is called cluster sampling . </li></ul>
  26. 26. Cluster Sampling -- continued <ul><li>The advantage of cluster sampling is sampling convenience (and possibly less cost). </li></ul><ul><li>It is straightforward to select a cluster sample. The key is to define the sampling units as the clusters, then select a simple random sample of clusters. Then sample all the population members in each selected cluster. </li></ul><ul><li>When all sampling units within each cluster are taken it is called a single stage sampling scheme. </li></ul><ul><li>Real applications are often more complex and result in multistage sampling schemes . </li></ul>
  27. 27. Example 8.4 An Introduction to Estimation
  28. 28. AUDIT.XLS <ul><li>An internal auditor for a furniture retailer wants to estimate the average of all accounts receivable taken over the population of all customer accounts. </li></ul><ul><li>The company has approximately 10,000 accounts. An exhaustive enumeration is impossible. </li></ul><ul><li>Therefore, the auditor randomly samples 100 of the accounts. This file contains the observed data. </li></ul><ul><li>What can the auditor conclude from this sample? </li></ul>
  29. 29. Random Sample
  30. 30. Sources of Estimation Error <ul><li>There are two basic sources of errors that can occur when we sample randomly from a population: </li></ul><ul><ul><li>Sampling error results from “unlucky” samples. </li></ul></ul><ul><ul><li>Nonsampling errors, which are quite different, can occur for a variety of reasons. </li></ul></ul><ul><ul><ul><li>Nonresponse bias is when a portion of the sample fails to respond to the survey. </li></ul></ul></ul><ul><ul><ul><li>Nontruthful responses are particularly a problem when asked sensitive questions. One solution is to use a randomized response technique by giving two sensitive questions: one sensitive, one innocuous. </li></ul></ul></ul><ul><ul><ul><li>Measurement error occurs when the responses to the questions do not reflect what the investigator had in mind. </li></ul></ul></ul>
  31. 31. Sampling Distribution of the Sample Mean <ul><li>We typically estimate the population mean by the sample mean of the randomly chosen sample. </li></ul><ul><li>The sample mean is called a point estimate of the population mean. </li></ul><ul><li>In general a point estimate of any population parameter is a single-value estimate of that parameter, based on observed sample data. </li></ul><ul><li>The sampling error is the difference between the observed sample mean and the true population mean. </li></ul>
  32. 32. Sampling Distribution of the Sample Mean <ul><li>A negative sampling error means an underestimate of the population mean. </li></ul><ul><li>The standard deviation of the observed sample mean is called the standard error of the mean . </li></ul><ul><li>The sample mean is an unbiased estimate of the population mean. </li></ul>
  33. 33. Solution <ul><li>The receivables for the 100 sampled accounts appear in column E. </li></ul><ul><li>We calculate the sample mean and the sample standard deviation. Then we calculate the (approximate) standard error of the mean with the formula =Sstdev/SQRT(SampSize) in cell B9. </li></ul>
  34. 34. Interpretation <ul><li>The auditor should interpret these values as follows: </li></ul><ul><ul><li>The sample mean can be used to estimate the unknown population mean. It provides a best guess for the average of the receiveables for the 10,000 accounts. </li></ul></ul><ul><ul><li>The standard error provides a measure of accuracy. </li></ul></ul><ul><li>The auditor can be 95% certain that the mean from all 10,000 accounts is within the interval $279 + or - $84, that is, between $195 and $363. </li></ul>
  35. 35. An Introduction to Estimation
  36. 36. Background Information <ul><li>Suppose you have he opportunity to play a game with a “wheel of fortune”. When you spin a large wheel, it is equally likely to stop in any position. </li></ul><ul><li>Depending on where it stops, you win anywhere from $0 to $1000. </li></ul><ul><li>Let’s suppose your winnings are actually based on not one, but n spins of the wheel. </li></ul><ul><li>If n =2, your winnings are based on the average of two spins. How does the distribution of your winnings depend on n ? </li></ul>
  37. 37. Random Sampling? <ul><li>What does this experiment have to do with random sampling? </li></ul><ul><li>Here, the population is the set of all outcomes we could obtain from a single spin of the wheel; that is, all dollar values from $0 to $1000. Each spin results in one randomly sampled dollar value from the population. </li></ul><ul><li>Furthermore, because we have assumed that the wheel is equally likely to land in any position, all possible values in the continuum from $0 to $1000 have the same chance of occurring. </li></ul>
  38. 38. Random Sampling? <ul><li>The resulting population distribution is called the uniform distribution on the interval from $0 to $1000. </li></ul><ul><li>It can be shown that the mean and standard deviation are $500 and $289, respectively. </li></ul>
  39. 39. SPIN1.XLS <ul><li>In order to analyze the distribution of winnings based on the average of n spins we need to do a sequence of simulations for n =1, n =2, n=3, n =6 and n =10. </li></ul><ul><li>This spreadsheet contains the simulation for n =1. The other simulations can be found in the following spreadsheets, SPIN2.XLS , SPIN3.XLS , SPIN6.XLS , and SPIN10.XLS . </li></ul><ul><li>For each simulation we consider 1000 replications of an experiment. </li></ul>
  40. 40. Simulations <ul><li>The experiment simulates n spins of the wheel and calculates the average - that is, the winnings - from the n spins. </li></ul><ul><li>Based on these 1000 replications, we can then calculate the average winnings, the standard deviation of winnings, and a histogram of winnings for each n . These will show clearly how the distribution of winnings depends on n . </li></ul><ul><li>The following slide shows the results for n =1. Here, there is no averaging. </li></ul>
  41. 42. Simulations -- continued <ul><li>To replicate the experiment 1000 times and collect statistics, we proceed as follows. </li></ul><ul><ul><li>Random outcomes. To generate outcomes uniformly distributed between $0 and $1000 we enter the formula =$B$3RAND( ) *($B$4-$B$3) in cells B11 and copy it down column B. The effect of this formula is to generate a random number between 0 and 1 and multiply it be $1000. </li></ul></ul><ul><ul><li>Summary measures. Calculate the average and standard deviation of the 1000 winnings in column B with the AVERAGE and STDEV functions. These values appear in cells E4 and E5. </li></ul></ul>
  42. 43. Simulations -- continued <ul><ul><li>Frequency table and histogram. Use the StatPro histogram procedure to create a histogram of the values in column B. </li></ul></ul><ul><ul><li>Note the following from the chart and graph from spin 1: </li></ul></ul><ul><ul><ul><li>The sample mean of the winnings (E4) is very close to the population mean: $500. </li></ul></ul></ul><ul><ul><ul><li>The standard deviation of the winnings (cell E5) is very close to the population standard deviation: $289. </li></ul></ul></ul><ul><ul><ul><li>The histogram is nearly flat. </li></ul></ul></ul><ul><ul><li>These should come as no surprise without any averaging taking place. Therefore, they are equivalent to the flat population distribution. </li></ul></ul>
  43. 44. Simulations -- continued <ul><li>But what happens when n > 1? </li></ul><ul><li>The following slide contains the chart and graph of the n =2 simulation. </li></ul><ul><li>To do this we formed a second column of outcomes in column C corresponding to a second spin in each experiment. We average the values in column B and C to obtain each of the winnings in column D. </li></ul><ul><li>The average winnings is very close to $500, but the standard deviation is much lower and the histogram is no longer flat. </li></ul>
  44. 46. Simulations -- continued <ul><li>The histogram is now triangular shaped - symmetric, but not yet bell shaped. </li></ul><ul><li>To develop similar simulations for n =3, n =6, n =10, or any other n , we insert additional outcome columns and make sure that the AVERAGE formula in the Winnings column average all n outcomes to its left. </li></ul><ul><li>They clearly show two effects of increasing n: </li></ul><ul><ul><li>the histogram becomes more bell shaped </li></ul></ul><ul><ul><li>there is less variability. </li></ul></ul>
  45. 47. Histogram for Three Spins
  46. 48. Histogram for Six Spins
  47. 49. Histogram for Ten Spins
  48. 50. Central Limit Theorem <ul><li>The mean stays right at $500. </li></ul><ul><li>This behavior is exactly as the central limit theorem predicts. </li></ul><ul><ul><li>For any population distribution with mean mu , the sampling distribution of the sample mean is approximately normal with the mean mu and the standard deviation , and the approximation as n increases. </li></ul></ul><ul><li>If fact, because the population distribution is symmetric in this example - it’s flat - we see the effect of the theorem for n much less than 30; it is already evident for n as low as 6. </li></ul>
  49. 51. An Introduction to Estimation
  50. 52. Background Information <ul><li>A marketing researcher has been hired by a videocassette rental company to estimate the average number of videocassettes rented annually by households in a particular metropolitan area. </li></ul><ul><li>The researcher decides to determine the sample size that makes the maximum probable absolute error approximately equal to 10, </li></ul><ul><li>Discuss how she should proceed. </li></ul>
  51. 53. Sample Size Determination <ul><li>The determination of sampling size is usually driven by sampling error considerations. </li></ul><ul><li>The usual procedure is to select an acceptable sampling error B called the maximum probable absolute error by using the equation </li></ul><ul><li>The implication is that if we randomly sample many members from the population, then there is a 95% chance that the resulting sampling error will be no greater than B in magnitude. </li></ul>
  52. 54. SAMPSIZE.XLS <ul><li>This file contains the data needed to solve the problem. </li></ul><ul><li>The researcher has chosen to maximize probable absolute error criterion with B=10, as the value she is willing to tolerate. </li></ul><ul><li>Therefore, she should use the maximum probable error equation. </li></ul>
  53. 55. Solution -- continued <ul><li>To use this equation she must estimate a value of  . </li></ul><ul><li>Based on her knowledge of the industry and available historical data, she uses a best guess of sigma =50. </li></ul><ul><li>She then uses the values from C7 and C8 to find the required sample size in C10 with the formula =4*PopStDev^2/MaxAbsErr^2 </li></ul><ul><li>Finally, she takes a sample of size 100 and observes the sample values shown in column F. Based on this sample, we calculate summary measures in the usual way in the range C13:C16. </li></ul>
  54. 56. Sample Size Determination
  55. 57. Results <ul><li>The absolute error in cell C16 is 2 times as great as the standard error in cell C15. </li></ul><ul><li>It is slightly higher than the maximum absolute error she specified in cell C8 because she observed a larger standard deviation than she had guessed. </li></ul><ul><li>In other words, the fact that there is evidently more variation in the population than she thought makes her sample mean based on 100 households slightly less accurate than she intended. </li></ul>