An Introduction to Simulation in the Social Sciences

AN INTRODUCTION TO
SIMULATION DESIGN IN
THE SOCIAL SCIENCES
By Francis Smart

Michigan State University
Agricultural, Food, and Resource Economics
Measurement and Quantitative Methods

www.econometricsbysimulation.com

Why Simulate?
Simulation is a detailed thought experiment:

1. Confirm theoretical results.

2. Explore the unknown theoretical environments.

3. Statistical method for generating estimates.

Why Simulate?
1. Confirmatory results:
a. Develop theory
b. Design simulation
c. Get results
d. Sensitivity analysis

2. Exploratory analysis:
a. Develop simulation
b. Get results
c. Develop theory
d. Sensitivity analysis

3. Statistical estimators:
a. Bootstrap
b. Markov Chain Monte Carlo (Bayesian)

Some examples
• Confirmatory:
1. Econometrician – new estimator, demonstrate performance
2. Psychometrician – new item response function, demonstrates
performance

• Exploratory:
1. Econometrician – test performance of consistent estimator
on small sample
2. Epidemiologist – explore the effects of different levels of
mosquito net usage in a dynamic infection model
3. Educational researcher – wonder about the best way to
estimate teacher ability when students are non-randomly
assigned.

Simulation Stages
All simulations can be broken down into a series of discrete
stages.
Calculate/Store Compute
Assign Generate Data Results
Specify Model Indicators
Parameters
Choose Survey Most Know what Perform
theoretical literature simulations indicators summary
paradigm Calibrate generate a you need statistics
model data set for and develop from the
every time methods for collection of
Draw from they run. generating indicators
real data those and the
indicators. parameters
that
generated
them.

Repeat

1. Specify Model
• Identify underlying model (theoretical paradigm)
This should be obvious usually based on the discipline
which you are in though it is not uncommon for
simulations to be interdisciplinary in nature.

• Identify minimum required complexity
Generally the simpler the model for which you can
test/demonstrate your theory, the better. The more
complexity in your model the more places for
uncertainty in what is driving your results.

Choice of Environment
Stata or R*
1. Most people will have a previously defined preference.

2. Simple simulations are often easier in Stata because of built
in commands like “simulate”

3. Simulations handling multiple agents, multiple data sets, or
complex relationships are often easier in R.

4. Stata is to Accounting like R is to Tetris.

* There are many other programming languages suitable for
simulation studies. These are the two which I know well.

2. Assign Parameters
• Survey the literature for reasonable model parameters.

• Estimate reasonable model parameters from available
data.

• Generate a reasonable argument for parameter choices
without theoretical backing.

• Allow some parameters to vary either gradually or
randomly.

Model Calibration
• Typically there are parameters available for which no estimates are
available.

• Modify these parameters in such a ways as to calibrate the model in
such a way as to lead to believable and desirable outcomes.

• For instance: In the malaria transmission simulation we varied
mosquito speed and malaria resistance rates to achieve a desired
infection rate among the general population of 15-30% at stead
state.

3. Generate Data
• Draw from theoretical distributions.

Distribution Stata R
Normal rnormal() rnorm()
Uniform runiform() runif()
Poisson rpoisson() rpois()
Bernoulli rbinomial(1,…) rbinomial(…,1,…)

• Resample from available data. Bootstrapping (for instance)

• Sort or organize data.

Random Seed
• Most programs are incapable of generating truly random
numbers.

• Often, truly random numbers are undesirable.

• If randomness exists, then results cannot be duplicated.

• Setting the seed allows for exactly duplicate ‘random’
variables to be generated. Thus results do not change.

Calculate results
• Know what results are needed for confirmation of your theory. For
example:
1. Benefit of bednet usage is greater than the cost of bednets
2. The estimator is unbiased.
3. Estimates from one estimator are better than those from another.

• Know what results are needed for confirmation that simulation is
working properly. For example:
1. Students should only have one teacher per grade.
2. The skewedness of the explanatory variable should be less than that of
the dependent variable.

Repeat
• This may seem like a trivial task but it is not. Repetition is essential
in most simulations. It is generally unconvincing (and often
uninformative) to run a simulation only once.

• Some people do not believe results of any simulation that is not
repeated at least 1000 times.

• How one repeats a simulation and how one interprets the results of
the collective set of repetitions are important questions. For
example:
1. Does one count the number of times that a mosquito net is profitable to
buy or how much on average return from purchasing mosquito nets is?
2. Does one present the average of an estimator and its standard deviation
or does one present how frequently the true parameter falls within the
confidence interval of the estimator.

Necessary Programming Tools
• Macros/scalar manipulation

• Data generating commands

• For/While loops

• The ability to store results after commands

Example Simulation:
Stata: Simulate the result of errors correlated with explanatory variable.

set more off
* Turn the scroll lock off (I have it set to permenently off on my computer)

clear
* Clear the old data

set obs 1000
* Tell stata you want 1000 observations available to be used for data
generation.

gen x = rnormal()
* This is some random explanatory variable

Sort x and u
sort x
* Now the data is ordered from the smallest x to the largest x

gen id = _n
* This will count from 1 to 1000 so that each observation has a unique id

gen u = rnormal()
* u is the unobserved error in the model

sort u
* Now the data is ordered from the smallest u to the largest u

gen x2 = .
* We are going to match up the smallest u with the smallest x.

Force the correlation between x draws and the
error to be positive.
* This will loop from 1 to 1000
forv i=1/1000 {
replace x2 = x[`i'] if id[`i']==_n
}

drop x
rename x2 x

corr x u
/* | x u
-------------+------------------
x | 1.0000
u | 0.9980 1.0000 */

Results
gen y = 5 + 2*x + u*5

reg y x

Source | SS df MS Number of obs = 1000
-------------+------------------------------ F( 1, 998) = .
Model | 50827.8493 1 50827.8493 Prob > F = 0.0000
Residual | 55.8351723 998 .055947066 R-squared = 0.9989
-------------+------------------------------ Adj R-squared = 0.9989
Total | 50883.6844 999 50.9346191 Root MSE = .23653

------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | 7.145123 .0074963 953.15 0.000 7.130412 7.159833
_cons | 4.858391 .0074869 648.92 0.000 4.843699 4.873083
------------------------------------------------------------------------------

* It is clear that we have shown that when the error is correlated in OLS that the
estimator can be severely biased.

Same simulation in R
x = sort(rnorm(1000))
u = sort(rnorm(1000))

y = 5 + 2*x + u*5

summary(lm(y~x))
# This simulation turns out to be extremely easy in R

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.75818 0.01281 371.5 <2e-16 ***
x 6.86977 0.01282 535.8 <2e-16 ***

Multi-agent simulations
• Are simulations in which agents with specified command routines
interact. Some result of that interaction is subsequently observed
and stored for analysis.

• An example from my work is a recent project with Andrew Dillon in
which we simulated an environment populated by both humans and
mosquitos. The human population stayed constant while the
mosquito population moved each round. Mosquitos had the
chance of becoming infected with malaria or infecting humans with
malaria. Two hundred days (rounds) were simulated per simulation
and the last thirty were used to calculate the returns from
technology choice for the group that decided to use prevention
technology at the beginning of the simulation relative to those who
decided against prevention technology.

Multi-agent simulations: Error Checking
Especially prone to errors. Develop error routines to check for bugs.

1. If assigning subjects to groups make sure all of the subjects have
only one group and all of the groups have equal numbers of
subjects (if balanced).

2. If generating composite random variables be sure the resulting
random variables have reasonable ranges (probabilities cannot
be less than 0 or greater than 1).

Graphical error checks
• Generate graphical figures as a means of checking for errors

The simulation appears to be converging on a stead state.

Statistical Estimators
• Bootstrap (case resampling)
The bootstrap routine takes advantage of the assumption of
random sampling. It is often used to estimate the variances of
random variables.

• Markov Chain Monte Carlo (Bayesian Estimation)
MCMC are a class of algorithms that has an equilibrium distribution
as a desired distribution. MCMC uses some kind of rules to move
from a specified prior distribution to a distribution reflective of the
sample distribution.

For Additional Reference

• For many more examples of simulations in R and
Stata go to www.econometricsbysimulation.com

An Introduction to Simulation in the Social Sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to An Introduction to Simulation in the Social Sciences

Similar to An Introduction to Simulation in the Social Sciences (20)

Recently uploaded

Recently uploaded (20)

An Introduction to Simulation in the Social Sciences