Your SlideShare is downloading. ×
Lecture 4 Applied Econometrics and Economic Modeling
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Lecture 4 Applied Econometrics and Economic Modeling

908
views

Published on

Applied Economic Modeling

Applied Economic Modeling

Published in: Education

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
908
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
52
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Methods for Selecting Random Samples
  • 2. RANDSAMP.XLS
    • This file contains data about the annual incomes of 40 families.
    • We want to choose a simple random sample of size 10 from this frame.
    • How can this be done?
    • And how do summary statistics of the chosen families compare to the corresponding summary statistics of the population?
  • 3. Data
  • 4. Sampling Terminology
    • In any sampling problem there is a relevant population , the set of all members about which the study intends to make inferences.
    • Before we select a sample from a given population, we typically need a list of all members of the population. This list is called the frame , and the potential sample members are called sampling units .
    • There are two type of samples, probability samples and judgmental samples.
  • 5. Sampling Terminology -- continued
    • A probability sample is a sample in which the sampling units are chosen from the population by means of a random mechanism such as a random number table.
    • No formal random mechanism is used to select a judgmental sample , in this case the sampling units are chosen according to the sampler’s judgment.
    • The simplest type of sampling scheme is appropriately called simple random sampling.
  • 6. Solution
    • The idea is very simple. We first generate a column of random numbers in column C. Then we sort the rows according to the random numbers and choose the first 10 families in the sorted rows.
    • The following procedure produces the results.
      • Random numbers. Enter the formula =RAND() in cell C10 and copy it down column C.
      • Replace with values. To enable sorting we must “freeze” the random numbers - that is, replace their formulas with values. To do third, select the range C10:C49 use Edit/Copy and then use Edit/Paste Special with the Values option.
  • 7. Solution -- continued
      • Copy to a new range. Copy the range A10:C49 to the range E10:G49.
      • Sort. Select the range E10:G49 and use the Data/Sort menu item. Sort according to the Random # column in ascending order. Then the 10 families with the 10 smallest random numbers are the ones in the sample.
      • Means. Use the AVERAGE, MEDIAN and STDEV functions in row 6 to calculate summary statistics of the first 10 incomes in column F.
  • 8. Results
  • 9. More Random Samples Automatically
    • If we would like more random samples of size 10 we would need to repeat the process repeatedly.
    • To save you the trouble, we have setup a macro to automate the process. See the Automated sheet of the RANDSAMP.XLS file. By clicking on the button we get a different random sample.
  • 10. Example 8.2 Methods for Selecting Random Samples
  • 11. RECEIVE.XLS
    • This file contains 280 accounts receivable for the Spring Mills Company. There are three variables:
      • Size: customer size (small, medium, large), depending on its volume of business with Spring Mills
      • Days: number of days since the customer was billed
      • Amount: amount of the bill
    • Generate 50 random samples of size 15 each from the small customers only, calculate the average amount owed in each random sample, and construct a histogram of these 50 averages.
  • 12. Generated Random Sample
  • 13. Solution
    • To select small accounts only, insert blank row after account 150 (the last small account).
    • Then, with the cursor anywhere in the small account data set, use the StatPro/Statistical Inference/Generate Random Samples enter 50 and 15 as the number of samples and the sample size, and put the results in a new sheet.
    • To find the amounts owed for the sampled accounts, enter the formula =VLOOKUP(B3,Data!Data,4) in cell B21 and copy it to the range B21:AY35.
  • 14. Solution -- continued
    • Then calculate the average in row 37 with the AVERAGE function and transpose this row of average to a column of averages in BA4:BA53 with the formula =TRANSPOSE(B37:AY37) and pressing Ctrl-Shift-Enter.
    • Use StatPro’s histogram procedure to create a histogram - each will look different because of the random numbers selected.
  • 15. Solution -- continued
    • The histogram indicates the variability of sample means we might obtain by selecting many different random samples of size 15 from this population of small customer accounts.
  • 16. Example 8.3 Methods for Selecting Random Samples
  • 17. STRATIFIED.XLS
    • This file contains a frame of all 1000 people in the city of Smalltown who have Sears credit cards.
    • Sears is interested in estimating the average number of other credit cards these people own, as well as other information about their use of credit.
    • The company decides to stratify these customers by age, select a stratified sample of size 100 with proportional sample sizes, and then contact these 100 people by phone.
    • How might Sears proceed?
  • 18. Systematic Sampling
    • A systematic sample provides a convenient way to choose the sample.
    • It works as follows:
      • First, we calculate the sampling interval as the population size divided by the sample size.
      • Next, we use a random mechanism to choose a number between 1 and 220 (Say number 131).
      • Then we choose the 131st name, the 351st name, the 571 and so on. The result is a systematic sample of size n=250.
  • 19. Stratified Sampling
    • Suppose we can identify various subpopulations within the total population. We call these subpopulations strata .
    • It makes sense to select a simple random sample from the stratum instead of from the entire population. This is called stratified sampling .
    • This method is particularly useful when there is considerable variation between the various strata but relatively little variation within a given stratum.
  • 20. Stratified Sampling -- continued
    • To obtain a stratified random sample we must choose a total sample size n , and we must choose a sample size n i for each stratum i .
    • There are many ways to choose these numbers but the most popular method is proportional sample sizes .
    • The advantage of proportional sample sizes is that they are very easy to determine. The disadvantage is that they ignore differences in variability among the strata.
  • 21. Solution
    • First Sears must decide exactly how to stratify by age.
    • There reasoning is that different age groups probably have different attitudes and behavior regarding credit.
    • After preliminary investigation they decide to have three age categories: 18-30, 31-62, and 63-80.
    • The calculation goes as follows:
      • the total sample size is cell C3
      • the definitions of the strata in rows 6-8
      • the customer data in range A11:B1010
  • 22. Stratified Sample
  • 23. Solution -- continued
    • To see what age category each customer falls in we enter the formula =IF(B11<=$D$6,1,IF(B11<=$D$7,2,3)) in cell C11 and then copy it down column C.
    • Next, it is useful to “unstack” the data into three groups, one for each age category.
      • It is easy to unstack the data in columns A-C.
      • With the cursor anywhere in A10:C1010 select StatPro/Data Utilities/Unstack Variables. Select Category as the Code variable, select Cust and Age as the variables to unstack, and accept the default location for the unstacked variables.
  • 24. Solution -- continued
    • Once the variables are unstacked we can calculate the counts and sample sizes in F6:G8 with the formulas =COUNT(E11:E142) and =ROUND(TotSampSize*F6/1000,0) .
    • Finally, we proceed by copying the data in columns E and F into L and M and append a a column of random numbers, sort on the random number column and choose the first 13 (or how ever many) customers.
    • The file shows the calculations for the other categories.
  • 25. Cluster Sampling
    • Suppose a company is interested in various characteristics of households in a particular city. The sampling units are households.
    • We could proceed with the sampling methods discussed but it would be more convenient another way.
    • We could divide the city into city blocks as sampling units and then sample all the households in the chosen blocks.
    • In this case the city blocks are called clusters and the sampling is called cluster sampling .
  • 26. Cluster Sampling -- continued
    • The advantage of cluster sampling is sampling convenience (and possibly less cost).
    • It is straightforward to select a cluster sample. The key is to define the sampling units as the clusters, then select a simple random sample of clusters. Then sample all the population members in each selected cluster.
    • When all sampling units within each cluster are taken it is called a single stage sampling scheme.
    • Real applications are often more complex and result in multistage sampling schemes .
  • 27. Example 8.4 An Introduction to Estimation
  • 28. AUDIT.XLS
    • An internal auditor for a furniture retailer wants to estimate the average of all accounts receivable taken over the population of all customer accounts.
    • The company has approximately 10,000 accounts. An exhaustive enumeration is impossible.
    • Therefore, the auditor randomly samples 100 of the accounts. This file contains the observed data.
    • What can the auditor conclude from this sample?
  • 29. Random Sample
  • 30. Sources of Estimation Error
    • There are two basic sources of errors that can occur when we sample randomly from a population:
      • Sampling error results from “unlucky” samples.
      • Nonsampling errors, which are quite different, can occur for a variety of reasons.
        • Nonresponse bias is when a portion of the sample fails to respond to the survey.
        • Nontruthful responses are particularly a problem when asked sensitive questions. One solution is to use a randomized response technique by giving two sensitive questions: one sensitive, one innocuous.
        • Measurement error occurs when the responses to the questions do not reflect what the investigator had in mind.
  • 31. Sampling Distribution of the Sample Mean
    • We typically estimate the population mean by the sample mean of the randomly chosen sample.
    • The sample mean is called a point estimate of the population mean.
    • In general a point estimate of any population parameter is a single-value estimate of that parameter, based on observed sample data.
    • The sampling error is the difference between the observed sample mean and the true population mean.
  • 32. Sampling Distribution of the Sample Mean
    • A negative sampling error means an underestimate of the population mean.
    • The standard deviation of the observed sample mean is called the standard error of the mean .
    • The sample mean is an unbiased estimate of the population mean.
  • 33. Solution
    • The receivables for the 100 sampled accounts appear in column E.
    • We calculate the sample mean and the sample standard deviation. Then we calculate the (approximate) standard error of the mean with the formula =Sstdev/SQRT(SampSize) in cell B9.
  • 34. Interpretation
    • The auditor should interpret these values as follows:
      • The sample mean can be used to estimate the unknown population mean. It provides a best guess for the average of the receiveables for the 10,000 accounts.
      • The standard error provides a measure of accuracy.
    • The auditor can be 95% certain that the mean from all 10,000 accounts is within the interval $279 + or - $84, that is, between $195 and $363.
  • 35. An Introduction to Estimation
  • 36. Background Information
    • Suppose you have he opportunity to play a game with a “wheel of fortune”. When you spin a large wheel, it is equally likely to stop in any position.
    • Depending on where it stops, you win anywhere from $0 to $1000.
    • Let’s suppose your winnings are actually based on not one, but n spins of the wheel.
    • If n =2, your winnings are based on the average of two spins. How does the distribution of your winnings depend on n ?
  • 37. Random Sampling?
    • What does this experiment have to do with random sampling?
    • Here, the population is the set of all outcomes we could obtain from a single spin of the wheel; that is, all dollar values from $0 to $1000. Each spin results in one randomly sampled dollar value from the population.
    • Furthermore, because we have assumed that the wheel is equally likely to land in any position, all possible values in the continuum from $0 to $1000 have the same chance of occurring.
  • 38. Random Sampling?
    • The resulting population distribution is called the uniform distribution on the interval from $0 to $1000.
    • It can be shown that the mean and standard deviation are $500 and $289, respectively.
  • 39. SPIN1.XLS
    • In order to analyze the distribution of winnings based on the average of n spins we need to do a sequence of simulations for n =1, n =2, n=3, n =6 and n =10.
    • This spreadsheet contains the simulation for n =1. The other simulations can be found in the following spreadsheets, SPIN2.XLS , SPIN3.XLS , SPIN6.XLS , and SPIN10.XLS .
    • For each simulation we consider 1000 replications of an experiment.
  • 40. Simulations
    • The experiment simulates n spins of the wheel and calculates the average - that is, the winnings - from the n spins.
    • Based on these 1000 replications, we can then calculate the average winnings, the standard deviation of winnings, and a histogram of winnings for each n . These will show clearly how the distribution of winnings depends on n .
    • The following slide shows the results for n =1. Here, there is no averaging.
  • 41.  
  • 42. Simulations -- continued
    • To replicate the experiment 1000 times and collect statistics, we proceed as follows.
      • Random outcomes. To generate outcomes uniformly distributed between $0 and $1000 we enter the formula =$B$3RAND( ) *($B$4-$B$3) in cells B11 and copy it down column B. The effect of this formula is to generate a random number between 0 and 1 and multiply it be $1000.
      • Summary measures. Calculate the average and standard deviation of the 1000 winnings in column B with the AVERAGE and STDEV functions. These values appear in cells E4 and E5.
  • 43. Simulations -- continued
      • Frequency table and histogram. Use the StatPro histogram procedure to create a histogram of the values in column B.
      • Note the following from the chart and graph from spin 1:
        • The sample mean of the winnings (E4) is very close to the population mean: $500.
        • The standard deviation of the winnings (cell E5) is very close to the population standard deviation: $289.
        • The histogram is nearly flat.
      • These should come as no surprise without any averaging taking place. Therefore, they are equivalent to the flat population distribution.
  • 44. Simulations -- continued
    • But what happens when n > 1?
    • The following slide contains the chart and graph of the n =2 simulation.
    • To do this we formed a second column of outcomes in column C corresponding to a second spin in each experiment. We average the values in column B and C to obtain each of the winnings in column D.
    • The average winnings is very close to $500, but the standard deviation is much lower and the histogram is no longer flat.
  • 45.  
  • 46. Simulations -- continued
    • The histogram is now triangular shaped - symmetric, but not yet bell shaped.
    • To develop similar simulations for n =3, n =6, n =10, or any other n , we insert additional outcome columns and make sure that the AVERAGE formula in the Winnings column average all n outcomes to its left.
    • They clearly show two effects of increasing n:
      • the histogram becomes more bell shaped
      • there is less variability.
  • 47. Histogram for Three Spins
  • 48. Histogram for Six Spins
  • 49. Histogram for Ten Spins
  • 50. Central Limit Theorem
    • The mean stays right at $500.
    • This behavior is exactly as the central limit theorem predicts.
      • For any population distribution with mean mu , the sampling distribution of the sample mean is approximately normal with the mean mu and the standard deviation , and the approximation as n increases.
    • If fact, because the population distribution is symmetric in this example - it’s flat - we see the effect of the theorem for n much less than 30; it is already evident for n as low as 6.
  • 51. An Introduction to Estimation
  • 52. Background Information
    • A marketing researcher has been hired by a videocassette rental company to estimate the average number of videocassettes rented annually by households in a particular metropolitan area.
    • The researcher decides to determine the sample size that makes the maximum probable absolute error approximately equal to 10,
    • Discuss how she should proceed.
  • 53. Sample Size Determination
    • The determination of sampling size is usually driven by sampling error considerations.
    • The usual procedure is to select an acceptable sampling error B called the maximum probable absolute error by using the equation
    • The implication is that if we randomly sample many members from the population, then there is a 95% chance that the resulting sampling error will be no greater than B in magnitude.
  • 54. SAMPSIZE.XLS
    • This file contains the data needed to solve the problem.
    • The researcher has chosen to maximize probable absolute error criterion with B=10, as the value she is willing to tolerate.
    • Therefore, she should use the maximum probable error equation.
  • 55. Solution -- continued
    • To use this equation she must estimate a value of  .
    • Based on her knowledge of the industry and available historical data, she uses a best guess of sigma =50.
    • She then uses the values from C7 and C8 to find the required sample size in C10 with the formula =4*PopStDev^2/MaxAbsErr^2
    • Finally, she takes a sample of size 100 and observes the sample values shown in column F. Based on this sample, we calculate summary measures in the usual way in the range C13:C16.
  • 56. Sample Size Determination
  • 57. Results
    • The absolute error in cell C16 is 2 times as great as the standard error in cell C15.
    • It is slightly higher than the maximum absolute error she specified in cell C8 because she observed a larger standard deviation than she had guessed.
    • In other words, the fact that there is evidently more variation in the population than she thought makes her sample mean based on 100 households slightly less accurate than she intended.