Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Introduction to Simulation in the Social Sciences


Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

An Introduction to Simulation in the Social Sciences

  1. 1. AN INTRODUCTION TO SIMULATION DESIGN IN THE SOCIAL SCIENCESBy Francis SmartMichigan State UniversityAgricultural, Food, and Resource EconomicsMeasurement and Quantitative
  2. 2. Why Simulate?Simulation is a detailed thought experiment:1. Confirm theoretical results.2. Explore the unknown theoretical environments.3. Statistical method for generating estimates.
  3. 3. Why Simulate?1. Confirmatory results: a. Develop theory b. Design simulation c. Get results d. Sensitivity analysis2. Exploratory analysis: a. Develop simulation b. Get results c. Develop theory d. Sensitivity analysis3. Statistical estimators: a. Bootstrap b. Markov Chain Monte Carlo (Bayesian)
  4. 4. Some examples• Confirmatory:1. Econometrician – new estimator, demonstrate performance2. Psychometrician – new item response function, demonstrates performance• Exploratory:1. Econometrician – test performance of consistent estimator on small sample2. Epidemiologist – explore the effects of different levels of mosquito net usage in a dynamic infection model3. Educational researcher – wonder about the best way to estimate teacher ability when students are non-randomly assigned.
  5. 5. Simulation StagesAll simulations can be broken down into a series of discretestages. Calculate/Store Compute Assign Generate Data Results Specify Model Indicators Parameters Choose Survey Most Know what Perform theoretical literature simulations indicators summary paradigm Calibrate generate a you need statistics model data set for and develop from the every time methods for collection of Draw from they run. generating indicators real data those and the indicators. parameters that generated them. Repeat
  6. 6. 1. Specify Model• Identify underlying model (theoretical paradigm) This should be obvious usually based on the discipline which you are in though it is not uncommon for simulations to be interdisciplinary in nature.• Identify minimum required complexity Generally the simpler the model for which you can test/demonstrate your theory, the better. The more complexity in your model the more places for uncertainty in what is driving your results.
  7. 7. Choice of Environment Stata or R*1. Most people will have a previously defined preference.2. Simple simulations are often easier in Stata because of builtin commands like “simulate”3. Simulations handling multiple agents, multiple data sets, orcomplex relationships are often easier in R.4. Stata is to Accounting like R is to Tetris.* There are many other programming languages suitable forsimulation studies. These are the two which I know well.
  8. 8. 2. Assign Parameters• Survey the literature for reasonable model parameters.• Estimate reasonable model parameters from available data.• Generate a reasonable argument for parameter choices without theoretical backing.• Allow some parameters to vary either gradually or randomly.
  9. 9. Model Calibration• Typically there are parameters available for which no estimates are available.• Modify these parameters in such a ways as to calibrate the model in such a way as to lead to believable and desirable outcomes.• For instance: In the malaria transmission simulation we varied mosquito speed and malaria resistance rates to achieve a desired infection rate among the general population of 15-30% at stead state.
  10. 10. 3. Generate Data• Draw from theoretical distributions. Distribution Stata R Normal rnormal() rnorm() Uniform runiform() runif() Poisson rpoisson() rpois() Bernoulli rbinomial(1,…) rbinomial(…,1,…)• Resample from available data. Bootstrapping (for instance)• Sort or organize data.
  11. 11. Random Seed• Most programs are incapable of generating truly random numbers.• Often, truly random numbers are undesirable.• If randomness exists, then results cannot be duplicated.• Setting the seed allows for exactly duplicate ‘random’ variables to be generated. Thus results do not change.
  12. 12. Calculate results• Know what results are needed for confirmation of your theory. For example: 1. Benefit of bednet usage is greater than the cost of bednets 2. The estimator is unbiased. 3. Estimates from one estimator are better than those from another.• Know what results are needed for confirmation that simulation is working properly. For example: 1. Students should only have one teacher per grade. 2. The skewedness of the explanatory variable should be less than that of the dependent variable.
  13. 13. Repeat• This may seem like a trivial task but it is not. Repetition is essential in most simulations. It is generally unconvincing (and often uninformative) to run a simulation only once.• Some people do not believe results of any simulation that is not repeated at least 1000 times.• How one repeats a simulation and how one interprets the results of the collective set of repetitions are important questions. For example: 1. Does one count the number of times that a mosquito net is profitable to buy or how much on average return from purchasing mosquito nets is? 2. Does one present the average of an estimator and its standard deviation or does one present how frequently the true parameter falls within the confidence interval of the estimator.
  14. 14. Necessary Programming Tools• Macros/scalar manipulation• Data generating commands• For/While loops• The ability to store results after commands
  15. 15. Example Simulation:Stata: Simulate the result of errors correlated with explanatory variable.set more off* Turn the scroll lock off (I have it set to permenently off on my computer)clear* Clear the old dataset obs 1000* Tell stata you want 1000 observations available to be used for datageneration.gen x = rnormal()* This is some random explanatory variable
  16. 16. Sort x and usort x* Now the data is ordered from the smallest x to the largest xgen id = _n* This will count from 1 to 1000 so that each observation has a unique idgen u = rnormal()* u is the unobserved error in the modelsort u* Now the data is ordered from the smallest u to the largest ugen x2 = .* We are going to match up the smallest u with the smallest x.
  17. 17. Force the correlation between x draws and theerror to be positive.* This will loop from 1 to 1000forv i=1/1000 { replace x2 = x[`i] if id[`i]==_n}drop xrename x2 xcorr x u/* | x u-------------+------------------ x | 1.0000 u | 0.9980 1.0000 */
  18. 18. Resultsgen y = 5 + 2*x + u*5reg y x Source | SS df MS Number of obs = 1000-------------+------------------------------ F( 1, 998) = . Model | 50827.8493 1 50827.8493 Prob > F = 0.0000 Residual | 55.8351723 998 .055947066 R-squared = 0.9989-------------+------------------------------ Adj R-squared = 0.9989 Total | 50883.6844 999 50.9346191 Root MSE = .23653------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- x | 7.145123 .0074963 953.15 0.000 7.130412 7.159833 _cons | 4.858391 .0074869 648.92 0.000 4.843699 4.873083------------------------------------------------------------------------------* It is clear that we have shown that when the error is correlated in OLS that theestimator can be severely biased.
  19. 19. Same simulation in Rx = sort(rnorm(1000))u = sort(rnorm(1000))y = 5 + 2*x + u*5summary(lm(y~x))# This simulation turns out to be extremely easy in RCoefficients: Estimate Std. Error t value Pr(>|t|)(Intercept) 4.75818 0.01281 371.5 <2e-16 ***x 6.86977 0.01282 535.8 <2e-16 ***
  20. 20. Multi-agent simulations• Are simulations in which agents with specified command routines interact. Some result of that interaction is subsequently observed and stored for analysis.• An example from my work is a recent project with Andrew Dillon in which we simulated an environment populated by both humans and mosquitos. The human population stayed constant while the mosquito population moved each round. Mosquitos had the chance of becoming infected with malaria or infecting humans with malaria. Two hundred days (rounds) were simulated per simulation and the last thirty were used to calculate the returns from technology choice for the group that decided to use prevention technology at the beginning of the simulation relative to those who decided against prevention technology.
  21. 21. Multi-agent simulations: Error CheckingEspecially prone to errors. Develop error routines to check for bugs.1. If assigning subjects to groups make sure all of the subjects have only one group and all of the groups have equal numbers of subjects (if balanced).2. If generating composite random variables be sure the resulting random variables have reasonable ranges (probabilities cannot be less than 0 or greater than 1).
  22. 22. Graphical error checks• Generate graphical figures as a means of checking for errorsThe simulation appears to be converging on a stead state.
  23. 23. Statistical Estimators• Bootstrap (case resampling) The bootstrap routine takes advantage of the assumption ofrandom sampling. It is often used to estimate the variances ofrandom variables.• Markov Chain Monte Carlo (Bayesian Estimation) MCMC are a class of algorithms that has an equilibrium distributionas a desired distribution. MCMC uses some kind of rules to movefrom a specified prior distribution to a distribution reflective of thesample distribution.
  24. 24. For Additional Reference• For many more examples of simulations in R and Stata go to