1. Optimal Sample Design for Tax Audits Under a Constraint
Jamie Schreader and Michelle Norris, PhD.
California State University, Sacramento
Department of Mathematics and Statistics
Introduction
Stratified random sampling is a method of sampling that involves dividing a pop-
ulation into smaller, homogenous groups, known as strata. The California Board of
Equalization uses stratified sampling in order to limit the number of invoices the em-
ployees audit each year. The total tax error for a company is determined by finding
an average dollar amount of error per stratum. That amount is used to extrapolate
the total error in an audit population. In order to include a stratum in the total error,
the sample of invoices must contain at least three invoices in error. This is referred
to as the three error rule. The goal of this project is determine optimal sample sizes
through various methods of stratification while minimizing the variance and bias of
the error estimate, known as the Mean Square Error (MSE), under the three error
rule. Using R, a statistical computing program, Monte Carlo simulations are used to
check theoretical calculations of MSE.
The simulated data set used is modeled after an invoice population from the Cal-
ifornia Board of Equalization. The invoice population has 13, 300 invoices in total;
three thousand invoices ranging from $0 − $100, six thousand invoices ranging from
$100 − $500, four thousand invoices ranging from $500 − $5000, and three hundred
invoices with a total in excess of $5000. It was determined that the probability of
error for small invoices less than $500 is roughly 20%, for medium invoices between
$500 and $5000 is roughly 15%, and for large invoices in excess of $5000 is roughly
2%. A sample data set was created for each range using a random number generator
with a uniform distribution in R. For simulation purposes, this allowed for a test data
set, to ensure theoretical calculations matched Monte Carlo simulation results.
Method Development
N = population size
n = sample size
X = population of invoices amounts
Y = population of error amounts
J = number of errors in population
K = number of errors in sample
¯xk = estimator of population mean, given k errors in the sample
Yi =
0, when the i-th invoice is not in error,
Xi, when the i-th invoice is in the error.
τy =
N
i=1
Yi
¯Yn =
(n − k) · 0 + k
i=1 yi
n
=
k
n
¯xk
MSE = E[(N · ¯Yn − τy)2
] = E[(N · ¯Yn)2
] − 2 · E[N · ¯Yn · τy] + E[τ2
y ]
E[(N · ¯Yn)2
] = N2
N
j=0
min(n,J)
k=3
k
n
2
[σ2
xC(j, k)+µ2
x]
j
k
N−j
n−k
N
n
N
j
πj
(1−π)(N−j)
where C(j, k) = (j−k)(j−1)
jk2 + N−j
Nj
E[N · ¯Yn · τy] =
N
n
N
j=3
j[
σ2
x
j
·
N − j
N
+ µ2
x]
min(n,j)
k=3
k
j
k
N−j
n−k
N
n
N
j
πj
(1 − π)(N−j)
E[τ2
y ] =
N
k=0
k2
(
σ2
x
k
N − k
N
+ µ2
x)
N
k
πk
(1 − π)(N−k)
Description of Comparisons
The California Board of Equalization generally uses a take-all stratum for the
largest invoices, which means they look at every invoice in the stratum. The
following graphs compare sample sizes, coefficient of variation and MSE using the
Lavallee-Hidiroglou Method and Cumulative Square Root Frequency methods in
R to determine stratification boundaries.
Comparison of Stratification Methods
2 3 4
2004006008001000
With Take−All Stratum
Number of Strata
SampleSize
2 3 4
200300400500600700800
Without Take−All Stratum
Number of Strata
SampleSize
Sample Size by Number of Strata
Lavallee−Hidiroglou Method
Cumrootf Method
Figure 1: Sample sizes found using a default coefficient of variation of 0.15
400 600 800 1200
0.060.080.100.120.14
With Take−All Stratum
Sample Size
CoefficientofVariation(CV)
400 600 800 1200
0.060.080.100.120.14
Without Take−All Stratum
Sample Size
CoefficientofVariation(CV)
Lavallee−Hidiroglou method
Cumrootf method
CV by Sample SizeCV by Sample Size
Figure 2: CV is calculated for sample sizes ranging from 500-1200 invoices
2 3 4
1e+122e+123e+124e+12
With Take−All Stratum
Number of Strata
MeanSquareError(MSE)
2 3 4
2.5e+123.5e+124.5e+125.5e+12
Without Take−All Stratum
Number of Strata
MeanSquareError(MSE)
Lavallee−Hidiroglou method
Cumrootf method
MSE by Number of Strata
Figure 3: Mean Square Error calculated using Monte Carlo simulations for number of strata
Conclusions
• For a fixed coefficient of variation, the cumulative square root of the frequency
method is able to achieve smaller sample sizes than the Lavallee-Hidiroglou
method with a take all stratum.
• For a fixed sample size, the Lavallee-Hidiroglou method is able to achieve a
smaller coefficient of variation for each sample size than the cumulative square
root of the frequency method with a take all stratum.
• The lowest MSE achieved with a take all stratum is with two strata with the
Lavallee-Hidiroglou method; however, the highest MSE is achieved with three
strata with the Lavallee-Hidiroglou method.
• For a fixed coefficient of variation, the Lavallee-Hidiroglou method is able to
achieve smaller sample sizes than the cumulative square root of the frequency
method without a take all stratum.
• For a fixed sample size, the Lavallee-Hidiroglou method is able to achieve a
smaller coefficient of variation for each sample size than the cumulative square
root of the frequency method without a take all stratum.
• The lowest MSE achieved without a take all stratum is with two strata with the
cumulative square root of the frequency method; however, the highest MSE is
achieved with four strata with the cumulative square root of the frequency
method.
Future Directions
• A study that considers why the highest and lowest MSE are achieved by the same
method of stratification
• Continued work on the theoretical MSE calculation for stratified samples (current
formula is designed to find the MSE of simple random samples)
• Writing a program in R that will place a strata boundary behind each invoice,
calculate the MSE for each boundary, and report the optimal boundary locations
by finding the smallest MSE achieved
• Using a package in R called Shiny, creating a graphical user interface that will
allow auditors to upload information about their invoice population in order to
find optimal strata boundaries
Acknowledgments
• Funding provided through a SURE Award from NSM
• Inspiration provided by the California Board of Equalization
• Simulation tools made possible by Ross Ihaka and Robert Gentleman, the
gentlemen who wrote the foundation of the language for the open source
statistical package, R