4. 3
totalfor Total independent expenditures used for the candidate in thousands of
dollars
totaldisb Candidate’s total disbursements in thousands of dollars
advert Amount of candidate’s total disbursements used for advertisements in
thousands of dollars
res Residuals found from regression of candidate’s vote total disbursements in
thousands of dollars on candidate’s vote margins
Additionally, it was found that the majority of observations fall below $1,000 (thousand)
for totaldisb, $500 (thousand) for advert, and -15 for res (or the approximate third quartile of
each response variable). Due to this, the data was split into seven different frames, one for
each subset of data according to the above cutoff points, plus one more dichotomization of the
subgroup for res. For instance, one data frame was created using only those observations that
fall below $1,000 (thousand) totaldisb, which will then be used to regress on totaldisb. These
data frames are named and summarized the following way for the report:
Table 2: New Data frames and their descriptions
Data Frame Name Description
totalDisbLessThan1000 Subset of the data that only includes observations whose
total disbursement is less than $1,000 (thousand)
totalDisbGreaterThan1000 Subset of the data that only includes observations whose
total disbursements is greater than (inclusive) $1,000
(thousand)
advertLessThan500 Subset of the data that only includes observations whose
total advertisement disbursements are less than $500
(thousand)
advertGreaterThan500 Subset of the data that only includes observations whose
total advertisement disbursements are greater than
(inclusive) $500 (thousand)
resLessThanNeg25 Subset of the data that only includes observations whose
residuals are less than negative 25
resGreaterThanNeg25LessThan15 Subset of the data that only includes observations whose
residuals are greater than (inclusive) negative 25 and less
than negative 15
resGreaterThanNeg15 Subset of the data that only includes observations whose
residuals are greater than (inclusive) negative 15
6. 5
will need to be considered as an interactive term while searching for a significant model in
regression.
Table 3: Table of p-values found from Fisher's Exact Test between categorical pairs; highlighted means <=.050
p-values
total
DisbLess
Than1000
total
DisbGreater
Than1000
advert
Less
Than500
advert
Greater
Than500
res
Less
Than-25
res
Greater
ThanNeg25
LessThan15
res
Greater
Than-15
open,
experienced 0.107 0.188 0.027 0.125 0.505
0.505 0.646
open,
competitive 0.042 0.000 0.234 0.002 0.000
1.000 0.000
open,
dark 0.416 0.046 0.356 0.226 0.205
0.235 0.053
experienced,
dark 0.002 1.000 0.000 0.881 0.656
0.061 1.000
competitive,
republican 1.000 0.481 0.026 0.369 0.220
1.000 0.557
competitive,
year2012 1.000 0.022 0.099 0.034 0.487
1.000 0.037
competitive,
year2014 1.000 0.022 0.099 0.034 0.487
1.000 0.037
competitive,
dark 1.000 0.000 0.000 0.000 1.000
1.000 0.000
year2012
,year2014 0.000 0.000 0.000 0.000 0.000
0.000 0.000
year2012,
dark 0.000 0.011 0.001 0.036 0.178
0.000 0.054
year2014,
dark 0.000 0.011 0.001 0.036 0.178
0.000 0.054
Measuring the level of relationship between a numerical variable and a categorical variable
can be found through one-way ANOVA tests, which specifically measure the difference of
means between categorical levels to see if they are statistically significant or not. If so, it can be
interpreted that the categorical variable shows association with the numerical variable. A
collection of the significant p-values found through one-way ANOVA tests can be found in Table
3 (only those rows with at least one significant p-value were given; the remaining p-values are
given in the appendix, section A).
7. 6
Table 4: One-way ANOVA Test p-values between numerical and categorical predictor variables
p-values
total
DisbLess
Than1000
total
Disb
Greater
Than1000
advert
Less
Than500
advert
Greater
Than500
res
Less
Than-
25
res
Greater
Than
Neg25
LessThan15
res
Greater
Than-15
dpres~open 0.877 0.117 0.634 0.195 0.739 0.581 0.050
dpres~republican 0.000 0.017 0.000 0.529 0.000 0.163 0.005
oppdisb~open 0.031 0.000 0.036 0.003 0.286 0.881 0.200
oppdisb~competitive 0.401 0.020 0.170 0.046 0.655 0.942 0.053
oppdisb~year2012 0.066 0.403 0.066 0.476 0.515 0.012 0.407
oppdisb~dark 0.121 0.925 0.008 0.773 0.975 0.156 0.789
totalagn~open 0.493 0.003 0.351 0.014 0.548 0.162 0.003
totalagn~experienced 0.017 0.203 0.004 0.138 0.145 0.880 0.219
totalagn~competitive 0.949 0.000 0.000 0.000 0.911 0.016 0.000
totalagn~dark 0.000 0.000 0.000 0.000 0.120 0.000 0.000
totalfor~open 0.936 0.259 0.866 0.362 0.658 0.935 0.295
totalfor~experienced 0.011 0.953 0.016 0.643 0.076 0.256 0.982
totalfor~competitive 0.776 0.001 0.000 0.007 0.946 0.003 0.005
totalfor~republican 0.881 0.002 0.291 0.006 0.632 0.125 0.003
totalfor~year2012 0.427 0.077 0.295 0.052 0.275 0.112 0.098
totalfor~dark 0.006 0.000 0.000 0.000 0.189 0.000 0.000
Within the table, the pairwise comparisons include numerical predictor variables against
each other. Significance between response variables and predictor variables indicate that those
predictor variables may be of most significance in a predictive model later, whereas significance
between two predictor variables shows the potential need for interactive terms. Due to the
number of comparisons being made, however, the probability of at least one of the significant
tests being due to random chance is relatively high. However, if an interactive term included in
model searching was originally created by two variables not truly dependent of each other,
then the interactive term will show insignificant, and thus be dropped, while building a model.
8. 7
IV. Methods
The interest of this study is to use multiple different control variables to attempt to predict
the output of different response variables, specifically totaldisb, advert, and res. One method of
doing this is to use multiple regression, simple or not. Since multiple different predictor
variables show relationships with one another, it will be necessary in this study to utilize non-
simple, linear regression, which takes the following form (1):
1 𝑌 = 𝛽& + 𝛽( ∗ 𝑋( + 𝛽+ ∗ 𝑋+ + ⋯ + 𝛽- ∗ 𝑋- + 𝛽-.( ∗ 𝑋/: 𝑋1 + ⋯
where Y represents the response variable; 𝛽& is the intercept of the model or the value of the
response when all else equals zero; 𝛽/ to 𝛽- is the partial effect of variables 𝑋( to 𝑋- while
holding all else constant, since 𝑋/ may or may not be in an interactive term; and 𝛽-.( is the
combined effect of 𝑋/ and 𝑋1 (with i and j being between 1 and n). If the interactive term is
made up of categorical variables, then the term only exists when both are set equal to one (or
when each categorical variable exemplifies the attribute given).
When attempting to build a multiple regression model using a list of available predictor
variables, it is common to use either a “step forward,” “step backward,” or “both directions”
method, which involves adding and discarding variables based off the significance level of the
variable. The significance level is found through a t-test (2) calculated by dividing the variable
estimate by the standard error of the estimate. This t-value is then used to compute a p-value.
2 𝑡 =
𝐵/
𝑆𝐸(𝐵/)
The build method utilized for this project was “step backward.” Specifically, this is done by
fitting the full model for the response variable, and then by removing each predictor variable
one at a time by least significance until all variables and the model itself is found to be
significant.
However, all of these methods are only tangible if the model assumptions hold: normality of
errors, constant variance of errors, and independence of errors. If at least one of the
assumptions fails, it is necessary to consider a model transformation such as logarithm of the
response variable or a box-cox transformation. The worst case scenario is that even a
transformation of the model will not correct the model assumptions.
V. Analysis and Results
By using a “step backwards” method for model building, a significant model was found for
totaldisb within totalDisbLessThan100, advert within advertLessThan500, and res within both
resLessThan-15 and resGreaterThan-15. Models were not found, however, for totaldisb within
totalDisbGreaterThan1000 and for advert within advertGreaterThan500, because either the
9. 8
variables provided were insignificant for the response, or the model did not meet the
assumptions. These are the significant models:
3.1 For data frame totalDisbLessThan1000:
𝒕𝒐𝒕𝒂𝒍𝒅𝒊𝒔𝒃 = 𝛽& + 𝛽( ∗ 𝑑𝑝𝑟𝑒𝑠 + 𝛽+ ∗ 𝑜𝑝𝑝𝑑𝑖𝑠𝑏 + 𝛽K ∗ 𝑡𝑜𝑡𝑎𝑙𝑎𝑔𝑛 + 𝛽P ∗ 𝑡𝑜𝑡𝑎𝑙𝑓𝑜𝑟 +
𝛽R ∗ 𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒𝑑 + 𝛽U ∗ 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 + 𝛽W ∗ 𝑦𝑒𝑎𝑟2012 + 𝛽Z ∗ 𝑜𝑝𝑒𝑛 + 𝛽[ ∗ 𝑑𝑎𝑟𝑘 +
𝛽(& ∗ 𝑑𝑝𝑟𝑒𝑠: 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛
3.2 For data frame advertLessThan500:
log 𝒂𝒅𝒗𝒆𝒓𝒕 = 𝛽& + 𝛽( ∗ 𝑑𝑝𝑟𝑒𝑠 + 𝛽+ ∗ 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 + 𝛽K ∗ 𝑜𝑝𝑒𝑛 + 𝛽P ∗ 𝑑𝑎𝑟𝑘 +
𝛽R ∗ 𝑑𝑝𝑟𝑒𝑠: 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛
3.3 For data frame resLessThanNeg25:
𝒓𝒆𝒔 = 𝛽& + 𝛽( ∗ 𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒𝑑 + 𝛽+ ∗ 𝑜𝑝𝑝𝑑𝑖𝑠𝑏 + 𝛽K ∗ 𝑡𝑜𝑡𝑎𝑙𝑎𝑔𝑛
3.4 For data frame resGreaterThanNeg25LessThanNeg25:
𝒓𝒆𝒔 = 𝛽& + 𝛽( ∗ 𝑜𝑝𝑒𝑛 + 𝛽+ ∗ 𝑐𝑜𝑚𝑝𝑒𝑡𝑖𝑡𝑖𝑣𝑒
3.5 For data frame resGreaterThanNeg25:
𝒓𝒆𝒔 = 𝛽& + 𝛽( ∗ 𝑑𝑝𝑟𝑒𝑠 + 𝛽+ ∗ 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 + 𝛽K ∗ 𝑜𝑝𝑝𝑑𝑖𝑠𝑏 + 𝛽P ∗ 𝑑𝑝𝑟𝑒𝑠: 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛
The output from each significant model is given in the following table (3), where column
(1) is model (3.1), column (2) is model (3.2), and so on. Additionally, the “constant” is the
intercept of each model.
11. 10
The individual variables in each model can be interpreted as normally would be for multiple
regression, except for those variables that are included in an interactive term, namely dpres
and republican. In order to interpret their effects on the model, it is necessary to hold all else
constant and determine the different values for each variable. By doing so for model (3.1), it
can be simplified as:
4 log 𝒕𝒐𝒕𝒂𝒍𝒅𝒊𝒔𝒃
= 2.543 + 0.029 ∗ 𝑑𝑝𝑟𝑒𝑠 + 4.395 ∗ 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛 − 0.079 ∗ 𝑑𝑝𝑟𝑒𝑠: 𝑟𝑒𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑛
Since dpres is continuous from zero to one hundred and republican is an indicator variable with
a potential value of zero (not republican) and one (republican), the interpretation can be
broken apart. For instance, if the candidate is not republican (4.1),
4.1 log 𝒕𝒐𝒕𝒂𝒍𝒅𝒊𝒔𝒃 = 2.543 + 0.029 ∗ 𝑑𝑝𝑟𝑒𝑠
then it can be seen that for each percentage increase of dpres, the candidate spends an extra
0.029 units of log(totaldisb). In other words, the candidate’s expenditures increase by a factor
of $1.029 (thousand). If, on the other hand, the candidate is republican, then things become
more complex (4.2).
4.2 log 𝒕𝒐𝒕𝒂𝒍𝒅𝒊𝒔𝒃 = 2.543 + 0.029 ∗ 𝑑𝑝𝑟𝑒𝑠 + 4.395 ∗ 1 − 0.079 ∗ 𝑑𝑝𝑟𝑒𝑠: (1)
which can be further simplified to (4.3):
4.3 log 𝒕𝒐𝒕𝒂𝒍𝒅𝒊𝒔𝒃 = 2.543 − 0.050 ∗ 𝑑𝑝𝑟𝑒𝑠 + 4.395
Thus, if a candidate is republican, they exemplify an initial 4.395 unit increase in log(totaldisb),
but then a 0.050 decrease in log(totaldisb) for each percentage increase of dpres. In other
words, a republican candidate shows an initial increase in totaldisb by a factor of $81.045
(thousand), but then a decrease in totaldisb by a factor of $𝑒l&.&R&∗mnopq
(thousand), where
dpres increases from zero to one hundred.
As mentioned prior, those variables not included in an interactive term may be interpreted
normally. For example, the variable open shows a 0.667-unit increase in log(totaldisb) for when
the seat is open versus when the seat is not, or that totaldisb increases by a factor of $1.948
(thousand) when the seat is open versus not open.
The model assumptions for each significant model passed and the details for each can be
found in the appendix, section A.