SlideShare a Scribd company logo
1 of 1
Download to read offline
Optimal Sample Design for Tax Audits Under a Constraint
Jamie Schreader and Michelle Norris, PhD.
California State University, Sacramento
Department of Mathematics and Statistics
Introduction
Stratified random sampling is a method of sampling that involves dividing a pop-
ulation into smaller, homogenous groups, known as strata. The California Board of
Equalization uses stratified sampling in order to limit the number of invoices the em-
ployees audit each year. The total tax error for a company is determined by finding
an average dollar amount of error per stratum. That amount is used to extrapolate
the total error in an audit population. In order to include a stratum in the total error,
the sample of invoices must contain at least three invoices in error. This is referred
to as the three error rule. The goal of this project is determine optimal sample sizes
through various methods of stratification while minimizing the variance and bias of
the error estimate, known as the Mean Square Error (MSE), under the three error
rule. Using R, a statistical computing program, Monte Carlo simulations are used to
check theoretical calculations of MSE.
The simulated data set used is modeled after an invoice population from the Cal-
ifornia Board of Equalization. The invoice population has 13, 300 invoices in total;
three thousand invoices ranging from $0 − $100, six thousand invoices ranging from
$100 − $500, four thousand invoices ranging from $500 − $5000, and three hundred
invoices with a total in excess of $5000. It was determined that the probability of
error for small invoices less than $500 is roughly 20%, for medium invoices between
$500 and $5000 is roughly 15%, and for large invoices in excess of $5000 is roughly
2%. A sample data set was created for each range using a random number generator
with a uniform distribution in R. For simulation purposes, this allowed for a test data
set, to ensure theoretical calculations matched Monte Carlo simulation results.
Method Development
N = population size
n = sample size
X = population of invoices amounts
Y = population of error amounts
J = number of errors in population
K = number of errors in sample
¯xk = estimator of population mean, given k errors in the sample
Yi =



0, when the i-th invoice is not in error,
Xi, when the i-th invoice is in the error.
τy =
N
i=1
Yi
¯Yn =
(n − k) · 0 + k
i=1 yi
n
=
k
n
¯xk
MSE = E[(N · ¯Yn − τy)2
] = E[(N · ¯Yn)2
] − 2 · E[N · ¯Yn · τy] + E[τ2
y ]
E[(N · ¯Yn)2
] = N2
N
j=0
min(n,J)
k=3
k
n
2
[σ2
xC(j, k)+µ2
x]
j
k
N−j
n−k
N
n
N
j
πj
(1−π)(N−j)
where C(j, k) = (j−k)(j−1)
jk2 + N−j
Nj
E[N · ¯Yn · τy] =
N
n
N
j=3
j[
σ2
x
j
·
N − j
N
+ µ2
x]
min(n,j)
k=3
k
j
k
N−j
n−k
N
n
N
j
πj
(1 − π)(N−j)
E[τ2
y ] =
N
k=0
k2
(
σ2
x
k
N − k
N
+ µ2
x)
N
k
πk
(1 − π)(N−k)
Description of Comparisons
The California Board of Equalization generally uses a take-all stratum for the
largest invoices, which means they look at every invoice in the stratum. The
following graphs compare sample sizes, coefficient of variation and MSE using the
Lavallee-Hidiroglou Method and Cumulative Square Root Frequency methods in
R to determine stratification boundaries.
Comparison of Stratification Methods
2 3 4
2004006008001000
With Take−All Stratum
Number of Strata
SampleSize
2 3 4
200300400500600700800
Without Take−All Stratum
Number of Strata
SampleSize
Sample Size by Number of Strata
Lavallee−Hidiroglou Method
Cumrootf Method
Figure 1: Sample sizes found using a default coefficient of variation of 0.15
400 600 800 1200
0.060.080.100.120.14
With Take−All Stratum
Sample Size
CoefficientofVariation(CV)
400 600 800 1200
0.060.080.100.120.14
Without Take−All Stratum
Sample Size
CoefficientofVariation(CV)
Lavallee−Hidiroglou method
Cumrootf method
CV by Sample SizeCV by Sample Size
Figure 2: CV is calculated for sample sizes ranging from 500-1200 invoices
2 3 4
1e+122e+123e+124e+12
With Take−All Stratum
Number of Strata
MeanSquareError(MSE)
2 3 4
2.5e+123.5e+124.5e+125.5e+12
Without Take−All Stratum
Number of Strata
MeanSquareError(MSE)
Lavallee−Hidiroglou method
Cumrootf method
MSE by Number of Strata
Figure 3: Mean Square Error calculated using Monte Carlo simulations for number of strata
Conclusions
• For a fixed coefficient of variation, the cumulative square root of the frequency
method is able to achieve smaller sample sizes than the Lavallee-Hidiroglou
method with a take all stratum.
• For a fixed sample size, the Lavallee-Hidiroglou method is able to achieve a
smaller coefficient of variation for each sample size than the cumulative square
root of the frequency method with a take all stratum.
• The lowest MSE achieved with a take all stratum is with two strata with the
Lavallee-Hidiroglou method; however, the highest MSE is achieved with three
strata with the Lavallee-Hidiroglou method.
• For a fixed coefficient of variation, the Lavallee-Hidiroglou method is able to
achieve smaller sample sizes than the cumulative square root of the frequency
method without a take all stratum.
• For a fixed sample size, the Lavallee-Hidiroglou method is able to achieve a
smaller coefficient of variation for each sample size than the cumulative square
root of the frequency method without a take all stratum.
• The lowest MSE achieved without a take all stratum is with two strata with the
cumulative square root of the frequency method; however, the highest MSE is
achieved with four strata with the cumulative square root of the frequency
method.
Future Directions
• A study that considers why the highest and lowest MSE are achieved by the same
method of stratification
• Continued work on the theoretical MSE calculation for stratified samples (current
formula is designed to find the MSE of simple random samples)
• Writing a program in R that will place a strata boundary behind each invoice,
calculate the MSE for each boundary, and report the optimal boundary locations
by finding the smallest MSE achieved
• Using a package in R called Shiny, creating a graphical user interface that will
allow auditors to upload information about their invoice population in order to
find optimal strata boundaries
Acknowledgments
• Funding provided through a SURE Award from NSM
• Inspiration provided by the California Board of Equalization
• Simulation tools made possible by Ross Ihaka and Robert Gentleman, the
gentlemen who wrote the foundation of the language for the open source
statistical package, R

More Related Content

Similar to MyPoster1

Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterJieming Wei
 
AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013OptiModel
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliMDO_Lab
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
IEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinIEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinMinchao Lin
 
Lecture 4 Applied Econometrics and Economic Modeling
Lecture 4 Applied Econometrics and Economic ModelingLecture 4 Applied Econometrics and Economic Modeling
Lecture 4 Applied Econometrics and Economic Modelingstone55
 
sampling
samplingsampling
samplingVin Moh
 
1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inference1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inferenceDev Pandey
 
statistics-for-analytical-chemistry (1).ppt
statistics-for-analytical-chemistry (1).pptstatistics-for-analytical-chemistry (1).ppt
statistics-for-analytical-chemistry (1).pptHalilIbrahimUlusoy
 
Cluster randomised trials with excessive cluster sizes: ethical and design im...
Cluster randomised trials with excessive cluster sizes: ethical and design im...Cluster randomised trials with excessive cluster sizes: ethical and design im...
Cluster randomised trials with excessive cluster sizes: ethical and design im...Karla hemming
 
PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...
PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...
PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...ijiert bestjournal
 
Air conditioner market case study
Air conditioner market case studyAir conditioner market case study
Air conditioner market case studyShashwat Shankar
 
statistical inference.pptx
statistical inference.pptxstatistical inference.pptx
statistical inference.pptxSoujanyaLk1
 
presentation on calculation of sample size
presentation on calculation of sample sizepresentation on calculation of sample size
presentation on calculation of sample sizeRichaMishra186341
 

Similar to MyPoster1 (20)

Accurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - PosterAccurate Campaign Targeting Using Classification - Poster
Accurate Campaign Targeting Using Classification - Poster
 
AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_Ali
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
Session02
Session02Session02
Session02
 
Test of hypotheses part ii
Test of hypotheses part iiTest of hypotheses part ii
Test of hypotheses part ii
 
IEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao LinIEOR 265 Final Paper_Minchao Lin
IEOR 265 Final Paper_Minchao Lin
 
DATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCHDATA COLLECTION IN RESEARCH
DATA COLLECTION IN RESEARCH
 
Bootstrap.ppt
Bootstrap.pptBootstrap.ppt
Bootstrap.ppt
 
Lecture 4 Applied Econometrics and Economic Modeling
Lecture 4 Applied Econometrics and Economic ModelingLecture 4 Applied Econometrics and Economic Modeling
Lecture 4 Applied Econometrics and Economic Modeling
 
Semana8 muestreo
Semana8 muestreoSemana8 muestreo
Semana8 muestreo
 
sampling
samplingsampling
sampling
 
1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inference1192012 155942 f023_=_statistical_inference
1192012 155942 f023_=_statistical_inference
 
statistics-for-analytical-chemistry (1).ppt
statistics-for-analytical-chemistry (1).pptstatistics-for-analytical-chemistry (1).ppt
statistics-for-analytical-chemistry (1).ppt
 
Sampling Technique
Sampling TechniqueSampling Technique
Sampling Technique
 
Cluster randomised trials with excessive cluster sizes: ethical and design im...
Cluster randomised trials with excessive cluster sizes: ethical and design im...Cluster randomised trials with excessive cluster sizes: ethical and design im...
Cluster randomised trials with excessive cluster sizes: ethical and design im...
 
PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...
PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...
PREDICTION BASED LOSSLESS COMPRESSION SCHEME FOR BAYER COLOUR FILTER ARRAY IM...
 
Air conditioner market case study
Air conditioner market case studyAir conditioner market case study
Air conditioner market case study
 
statistical inference.pptx
statistical inference.pptxstatistical inference.pptx
statistical inference.pptx
 
presentation on calculation of sample size
presentation on calculation of sample sizepresentation on calculation of sample size
presentation on calculation of sample size
 

MyPoster1

  • 1. Optimal Sample Design for Tax Audits Under a Constraint Jamie Schreader and Michelle Norris, PhD. California State University, Sacramento Department of Mathematics and Statistics Introduction Stratified random sampling is a method of sampling that involves dividing a pop- ulation into smaller, homogenous groups, known as strata. The California Board of Equalization uses stratified sampling in order to limit the number of invoices the em- ployees audit each year. The total tax error for a company is determined by finding an average dollar amount of error per stratum. That amount is used to extrapolate the total error in an audit population. In order to include a stratum in the total error, the sample of invoices must contain at least three invoices in error. This is referred to as the three error rule. The goal of this project is determine optimal sample sizes through various methods of stratification while minimizing the variance and bias of the error estimate, known as the Mean Square Error (MSE), under the three error rule. Using R, a statistical computing program, Monte Carlo simulations are used to check theoretical calculations of MSE. The simulated data set used is modeled after an invoice population from the Cal- ifornia Board of Equalization. The invoice population has 13, 300 invoices in total; three thousand invoices ranging from $0 − $100, six thousand invoices ranging from $100 − $500, four thousand invoices ranging from $500 − $5000, and three hundred invoices with a total in excess of $5000. It was determined that the probability of error for small invoices less than $500 is roughly 20%, for medium invoices between $500 and $5000 is roughly 15%, and for large invoices in excess of $5000 is roughly 2%. A sample data set was created for each range using a random number generator with a uniform distribution in R. For simulation purposes, this allowed for a test data set, to ensure theoretical calculations matched Monte Carlo simulation results. Method Development N = population size n = sample size X = population of invoices amounts Y = population of error amounts J = number of errors in population K = number of errors in sample ¯xk = estimator of population mean, given k errors in the sample Yi =    0, when the i-th invoice is not in error, Xi, when the i-th invoice is in the error. τy = N i=1 Yi ¯Yn = (n − k) · 0 + k i=1 yi n = k n ¯xk MSE = E[(N · ¯Yn − τy)2 ] = E[(N · ¯Yn)2 ] − 2 · E[N · ¯Yn · τy] + E[τ2 y ] E[(N · ¯Yn)2 ] = N2 N j=0 min(n,J) k=3 k n 2 [σ2 xC(j, k)+µ2 x] j k N−j n−k N n N j πj (1−π)(N−j) where C(j, k) = (j−k)(j−1) jk2 + N−j Nj E[N · ¯Yn · τy] = N n N j=3 j[ σ2 x j · N − j N + µ2 x] min(n,j) k=3 k j k N−j n−k N n N j πj (1 − π)(N−j) E[τ2 y ] = N k=0 k2 ( σ2 x k N − k N + µ2 x) N k πk (1 − π)(N−k) Description of Comparisons The California Board of Equalization generally uses a take-all stratum for the largest invoices, which means they look at every invoice in the stratum. The following graphs compare sample sizes, coefficient of variation and MSE using the Lavallee-Hidiroglou Method and Cumulative Square Root Frequency methods in R to determine stratification boundaries. Comparison of Stratification Methods 2 3 4 2004006008001000 With Take−All Stratum Number of Strata SampleSize 2 3 4 200300400500600700800 Without Take−All Stratum Number of Strata SampleSize Sample Size by Number of Strata Lavallee−Hidiroglou Method Cumrootf Method Figure 1: Sample sizes found using a default coefficient of variation of 0.15 400 600 800 1200 0.060.080.100.120.14 With Take−All Stratum Sample Size CoefficientofVariation(CV) 400 600 800 1200 0.060.080.100.120.14 Without Take−All Stratum Sample Size CoefficientofVariation(CV) Lavallee−Hidiroglou method Cumrootf method CV by Sample SizeCV by Sample Size Figure 2: CV is calculated for sample sizes ranging from 500-1200 invoices 2 3 4 1e+122e+123e+124e+12 With Take−All Stratum Number of Strata MeanSquareError(MSE) 2 3 4 2.5e+123.5e+124.5e+125.5e+12 Without Take−All Stratum Number of Strata MeanSquareError(MSE) Lavallee−Hidiroglou method Cumrootf method MSE by Number of Strata Figure 3: Mean Square Error calculated using Monte Carlo simulations for number of strata Conclusions • For a fixed coefficient of variation, the cumulative square root of the frequency method is able to achieve smaller sample sizes than the Lavallee-Hidiroglou method with a take all stratum. • For a fixed sample size, the Lavallee-Hidiroglou method is able to achieve a smaller coefficient of variation for each sample size than the cumulative square root of the frequency method with a take all stratum. • The lowest MSE achieved with a take all stratum is with two strata with the Lavallee-Hidiroglou method; however, the highest MSE is achieved with three strata with the Lavallee-Hidiroglou method. • For a fixed coefficient of variation, the Lavallee-Hidiroglou method is able to achieve smaller sample sizes than the cumulative square root of the frequency method without a take all stratum. • For a fixed sample size, the Lavallee-Hidiroglou method is able to achieve a smaller coefficient of variation for each sample size than the cumulative square root of the frequency method without a take all stratum. • The lowest MSE achieved without a take all stratum is with two strata with the cumulative square root of the frequency method; however, the highest MSE is achieved with four strata with the cumulative square root of the frequency method. Future Directions • A study that considers why the highest and lowest MSE are achieved by the same method of stratification • Continued work on the theoretical MSE calculation for stratified samples (current formula is designed to find the MSE of simple random samples) • Writing a program in R that will place a strata boundary behind each invoice, calculate the MSE for each boundary, and report the optimal boundary locations by finding the smallest MSE achieved • Using a package in R called Shiny, creating a graphical user interface that will allow auditors to upload information about their invoice population in order to find optimal strata boundaries Acknowledgments • Funding provided through a SURE Award from NSM • Inspiration provided by the California Board of Equalization • Simulation tools made possible by Ross Ihaka and Robert Gentleman, the gentlemen who wrote the foundation of the language for the open source statistical package, R