SlideShare a Scribd company logo
1 of 38
Download to read offline
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
Election Visualizations
Brian Thorsen
12/6/2016
## Data has been slightly modified; geographic information is now added in correct form.
load("elecDataCorrected.rda")
Introduction
Credit: Ryan Saraie, Ryan Haeri
In this study, we initiate the process of observing data from geographic sources, four elections, and the 2010 census to
observe the results and unique aspects of the 2016 presidential election. With president-elect Donald Trump’s victory
remaining at the center of the United States news cycle, it is important to understand the basis of his victory and analyze
what factors led to his success. To do so, we compare a wide variety of variables that may have indicated how well
Mr. Trump did in certain counties of the United States. It is important to note that we are not trying to predict the 2016
election; rather, we are looking for predictors that influenced it, specifically ones that strategists and data scientists may
have unaccounted for. We base our results from our own data frame, sourced from a variety of data sets that we compiled to
provide a simple base for making maps and statistical tests. In our analysis, we utilize visual and computational tools to
understand what patterns existed that led to a Trump presidency.
From the data frame, we create multiple plots and maps that highlight voter behavior by each state. These directly visual
representations show how the frequency of specific votes may have differed by county; these changes are crucial in
comparing previous elections to the 2016 election, and give us a sense for what past factors were at play to influence the
final election results. While the images give plenty of insight into how the 2016 election turned out, we ran a few tests to give
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
specific predictions for what the results could be. Our first test was to create a classification tree to determine how the 2016
election would conclude in voting. With this testing method, we cross-validated our data before conducting the test as a
means of obtaining more accurate results. In addition, we conducted a K Nearest Neighbor classifier in the data frame to
predict the change in 2012 to 2016 election results. With the labeling parameter for the data set at if the majority party per
county changed from 2012 to 2016, we compared a sample of train data with test data to make a prediction. While all of
these methods of analysis may not be fully accurate models for the 2016 election, they served as informationally useful tools
for understanding how certain numerical factors affected this country.
Data Overview and Description
Credit: Brian Thorsen
The data for which we will be conducting our analysis contains information from the 2016 presidential election, reported at
the county level in terms of popular vote count, as well as information for the previous three elections. Additionally,
geographic information on counties as well as information from the 2010 US Census is included, such as workforce
demographics, family composition, and median income for all counties. These various forms of information constitute
features from which we can build predictors for the 2016 election results.
Data was obtained from a variety of sources, but primarily hosted online through Professor Deborah Nolan’s website. As
these sources had to be merged together, information loss arose due to counties missing from single data sources, as well
as heterogeneity in county names which prevented seamless merges.
In preparing our data for later analysis, several issues arose which had to be addressed. Among these were the proper
handling of data being represented in the incorrect types, such as numeric data appearing as strings with added characters.
Some states were missing from the initial data frames provided: namely, Alaska was not present in the 2012 data, and
Virginia and Hawaii were not present in the 2004 data. For the latter two, we were able to obtain the results through parsing
data sources available elsewhere on the internet.
A particularly critical recurring issue was the handling of merges across the different data sources. To merge by county and
state pairs, it was necessary to obtain such information from every data source, which was often represented in different
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
ways. In one instance, the use of FIPS codes enabled seamless merging of the data, whereas other merges required string
manipulations to facilitate the matching of values. Additionally, an issue arose in the matter of missing values and values
which did not have a directly translatable equivalent in the data frame to be merged. To address these concerns, we
examined which values would be lost in the merge along the intersection of both data frames, and assessed the possibility of
manual reassignment of names. For counties in which manual reassignment was not a simple task, the next step was to
determine how important the counties were, qualitatively speaking. For instance, it was a much easier decision to drop the
least populous counties in New Mexico from the results than to drop Manhattan or Brooklyn, NY from the results.
Depending on the severity of information loss, merging operations occurred in one of two ways. If the data to be lost did not
seem of high importance (e.g., low population size), an intersection merge was performed to preserve complete data and
avoid the cropping up of missing values. If the counties were more important, it was preferable to preserve some information
for these counties in a left join merge, and at least retain most of the data on such counties. This is admittedly a qualitative
and subjective process, but our criteria is spelled out in all cases to allow the reader to view the process we chose.
Ultimately, we were able to obtain nearly complete data for 3,102 counties in the United States. This will enable us to obtain
robust, meaningful conclusions in our analysis.
Addressing heterogeneity of units
One potential issue that arises in the use of a K nearest neighbor predictor is the heterogeneity of units for the different
columns of data. For instance, the voting information is reported in orders of magnitude ranging from the order of 10 for
geographic information (latitude and longitude) to values between 0 and 100 for the census data. Counties which have
similar demographic information will not be considered nearest neighbors with these units, as the geographic distance is by
far the largest contributor to a calculation involving Euclidean distances.
To address this concern, we normalize the features of the data by converting to standard units. By doing so, all features are
weighed more or less equally in terms of determining the nearest neighbors of a county.
Note that this issue is not as relevant for classification trees, which create bifurcating decisions based on numeric thresholds
for quantitative data. These thresholds are largely independent of scale.
## Create copy
7
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
elecDataNorm <- elecData
## Convert to standard units
elecDataNorm[,c(3:49)] <- apply(elecData[,c(3:49)],2,scale)
Election Result Visualization
Credit: Brian Thorsen, edited by Ryan Haeri
library(ggplot2)
library(ggmap)
library(maps)
The first step in understanding the 2016 election results is to visually examine how the country voted. Three different maps
plotting the election results in different ways appear below, as each map conveys different but equally important information
about the election.
To plot election results, we sought an alternative approach to the standard “painting” of each geographic region according
to its voting patterns. To do so misrepresents the geographic clustering of the US population; for instance, the sparsely
populated Montana is visually interpreted as more important than the densely populated Bay Area. To address these issues,
we superimpose circles to convey popular vote information, where each circle’s area is proportional to the number of votes
being represented in the circle.
The first plot we will look at provides insight to how the election was decided at the state level, by plotting the marginal vote
county in favor of each candidate for each county.
## Filter out Hawaii for plotting purposes
elecDataNoHI <- elecData[elecData$state != "HI",]
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
## Split data by outcome to enable differentiated coloration
demWon16 <- elecDataNoHI[elecDataNoHI$dem16 > elecDataNoHI$gop16,]
gopWon16 <- elecDataNoHI[elecDataNoHI$gop16 > elecDataNoHI$dem16,]
## Little bit of info: hexadecimal color codes for the party colors
## come from examining the logo found on each candidate's website
## (code determined with a color selector app).
ggplot(data = demWon16,
mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6))) + ## Coordinate rescal
ing needed
## Provided gray base US map to be layered onto
borders("county", fill = "gray82", colour = "gray90") +
borders("state", colour = "white", size = 0.85) +
geom_point(aes(size = dem16 - gop16), alpha = 0.72, color = "#0057b8") +
geom_point(data = gopWon16,
mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6),
size = gop16 - dem16), alpha = 0.72, color = "#bc3523") +
## Set title, legend, etc.
scale_size(range = c(0, 7.5),
breaks = c(100, 1000, 10000, 100000, 1000000),
labels = c("100", "1,000", "10,000", "100,000", "1,000,000"),
name = "County Vote Margin",
guide = guide_legend(title.position = "top",
override.aes = list(color = "gray30"))) +
ggtitle("2016 U.S. Presidential Election: Popular Vote Margins") +
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
## Eliminate unwanted plot elements, move legend
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),axis.title.y=element_blank(),
panel.background=element_blank(),panel.border=element_blank(),
panel.grid.major=element_blank(),panel.grid.minor=element_blank(),
plot.background=element_blank(),
legend.position = "bottom", legend.box = "horizontal")
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
Examining an individual state on this map provides the information about how close the race was at the state level, and at a
glance provides information of which candidate won each state (except for states with some counties favoring Clinton and
others favoring Trump). However, the marginal vote gain for each candidate obscures how close a race might have been in
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
a given region; a populous county with an extremely close vote appears as a small circle on the map.
The following map examines the total vote at the county level, with each county colored according to how large the margin
was in favor of each candidate.
ggplot(data = elecDataNoHI,
mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6))) +
borders("county", fill = "gray82", colour = "gray90") +
borders("state", colour = "white", size = 0.85) +
geom_point(aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6),
size = gop16 + dem16,
color = gop16 / (gop16 + dem16)), alpha = 0.95) +
scale_size(range = c(0, 15),
breaks = c(200, 2000, 20000, 200000, 2000000),
labels = c("200", "2,000", "20,000", "200,000", "2,000,000"),
name = "Total County Vote",
guide = guide_legend(title.position = "top",
override.aes = list(color = "gray30"))) +
scale_color_gradient2(low = "#0057b8", mid = "lightyellow", high = "#bc3523",
midpoint = 0.5,
name = "% Trump Vote",
guide = guide_colorbar(title.position = "top")) +
ggtitle("2016 U.S. Presidential Election: Popular Vote Count") +
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),axis.title.y=element_blank(),
panel.background=element_blank(),panel.border=element_blank(),
panel.grid.major=element_blank(),panel.grid.minor=element_blank(),
plot.background=element_blank(),
legend.position = "bottom", legend.box = "horizontal")
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
This map provides better indication of how close the race was in the swing states, particularly in Florida. It also shows that
some larger cities in the Midwest actually were much closer races than the marginal vote might indicate (this also explains
the pockets of blue in intensely conservative areas in Texas, for instance). For counties in which the race was not close,
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
such as Los Angeles and Alameda County, this type of information is preserved between the two maps; that being said,
these are not the counties we are particularly interested in.
The final map highlights the counties which flipped between 2012 and 2016 in favor of the other party. These counties are
the ones of interest for the sake of predicting the change in voting patterns from 2012 to 2016.
## Create individual results about staying/flipping to allow differentiated coloration
demFlip <- elecDataNoHI[elecDataNoHI$dem16 > elecDataNoHI$gop16 &
elecDataNoHI$dem12 < elecDataNoHI$gop12 , ]
gopFlip <- elecDataNoHI[elecDataNoHI$dem16 < elecDataNoHI$gop16 &
elecDataNoHI$dem12 > elecDataNoHI$gop12 , ]
demStay <- elecDataNoHI[elecDataNoHI$dem16 > elecDataNoHI$gop16 &
elecDataNoHI$dem12 > elecDataNoHI$gop12 , ]
gopStay <- elecDataNoHI[elecDataNoHI$dem16 < elecDataNoHI$gop16 &
elecDataNoHI$dem12 < elecDataNoHI$gop12 , ]
ggplot(data = elecDataNoHI) +
borders("county", fill = "gray82", colour = "gray90") +
borders("state", colour = "white", size = 0.85) +
## Plot each case with a different color
geom_point(data = demFlip,
mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6),
size = dem16 - gop16), alpha = 0.95, color = "#0057b8") +
geom_point(data = gopFlip,
mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6),
size = gop16 - dem16), alpha = 0.95, color = "#bc3523") +
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
geom_point(data = demStay,
mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6),
size = dem16 - gop16), alpha = 0.55, color = "lightskyblue") +
geom_point(data = gopStay,
mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6),
size = gop16 - dem16), alpha = 0.55, color = "lightcoral") +
scale_size(range = c(0, 15),
breaks = c(100, 1000, 10000, 100000, 1000000),
labels = c("100", "1,000", "10,000", "100,000", "1,000,000"),
name = "County Vote Margin",
guide = guide_legend(title.position = "top",
override.aes = list(color = "gray30"))) +
ggtitle("Flipped Counties") +
theme(axis.line=element_blank(),axis.text.x=element_blank(),
axis.text.y=element_blank(),axis.ticks=element_blank(),
axis.title.x=element_blank(),axis.title.y=element_blank(),
panel.background=element_blank(),panel.border=element_blank(),
panel.grid.major=element_blank(),panel.grid.minor=element_blank(),
plot.background=element_blank(),
legend.position = "bottom", legend.box = "horizontal")
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
It is clear from this map that the large number of counties which flipped to the GOP in the Midwest are of special importance.
Most predictors prior to the election incorrectly predicted Wisconsin and Michigan as going to Clinton, and these two states
made the largest difference in deciding Trump’s victory. Additionally, few counties flipped in favor of Clinton; Orange County,
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
CA is the most prominent example of this flip, and this occurred in a solidly blue state.
One possible concern regarding the predictors is whether or not sufficient data points are available to predict a Democratic
flip. For a nearest neighbor analysis, it is quite possible that a K value around 10 might prohibit any possibility of predicting a
Democratic flip (if there are never enough similar counties that flipped this way to cast the majority of “votes” for a test
county). As we see later in the results, this issue is fully realized; the ideal K value for classification success rates, K = 8,
leads to zero counties being predicted as flipping Democratic.
2016 Election Result Predictor
Credit: Ryan Haeri, minor editing by Brian Thorsen
In this report, our aim was to build a classification tree based off of the 2016 election results that predicted how counties
voted in comparison to each other based off of similar features. We define “similar county” by comparing counties that
display relatively similar demographic features, population features by race and age, labor force size, unemployment, family
characteristics, work force population by race and age, location, income, etc. Our goal was to build a predictor based off of
these features in order to predict how counties with similar features voted.
The 2016 election was different than all recent elections in the sense that counties that traditionally voted Democrat voted
Republican. Our classification tree will also display past elections and how 2016 voted differently based off of certain county
features.
load("elecDataCorrected.rda")
library(ggplot2)
library(rpart)
library(rpart.plot)
In order to build our predictor, we set aside a test set to be used as our predictor during the predictor creation step. We
used two-thirds of our data to build a predictor based off of the features in our data frame, and the test set’s purpose was to
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
serve as a set of data that represented unknown future voter turnouts.
First, we set a random seed and take a sample of one-thirds of the election data without replacement so we don’t have any
repeated counties. We then created a new column that contains the results of those respective counties, which will be what
we are testing our predictor on at the end.
In order to make a test and training set, we subsetted a random sample of our indices from the original election data frame in
order to obtain a random sample of two-thirds of the rows for the training set and a random sample of one-third of the rows
for the test set, excluding the information we will not be examining (county and state information, along with democratic and
republican voting information). We are excluding these elements from our test and training set because these are
intrinsically related to the class of our data, containing information directly related to the class of our data.
set.seed(12341234)
numRow = nrow(elecData)
dataPartition = sample(numRow, size = (1/3)*numRow, replace = FALSE)
elecData$Win = factor(((elecData$dem16 > elecData$gop16)+0),
levels = c(0,1), labels = c("gop", "dem"))
elecTest = elecData[dataPartition, -c(1:4)]
elecTrain = elecData[-dataPartition, -c(1:4)]
In order to perform cross-validation, we had to partition our data into folds whose intersection is empty. They must be non-
overlapping because if we have repeated information in any one row, then the predictor would give abnormally accurate
results. Therefore, we permuted the indices in order to make non-overlapping folds, and created a matrix with 11 folds
containing the permuted row indices.
set.seed(12341234)
nrowTrain = nrow(elecTrain)
permutation = sample(nrowTrain, replace = FALSE)
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
folds = matrix(permutation, ncol = 11)
We wanted to build a predictor based off of 10/11 of elecTrain and predict the election results based off of 1/11 of the
data. We wanted to use each complexity parameter in the process of building our tree, so we could select a tuning
parameter based off of which one minimized our misclassification rate.
We used a double for loop to make our predictions over the folds with our complexity parameters. When building our
predictor, we hid the column that contained the election results in order to later compare the results of our predictor to our
true value.
We then aligned our partitions with the true values in order to see if our predictor accurately predicted the true election
results. Each column represents a different complexity parameter, and we applied a function to each complexity parameter
that provids the rate of accuracy.
cps = c(seq(0.001, 0.01, by = 0.003),
seq(0.01, 0.1, by = 0.03))
predictionMatrix = matrix(nrow = nrowTrain, ncol = length(cps))
for (i in 1:11) {
testFolds = folds[ ,i]
trainFolds = as.integer(folds[, -i])
for (j in 1:length(cps)) {
tree = rpart(Win ~ .,
data = elecTrain[trainFolds, ],
method = "class",
control = rpart.control(cp = cps[j]))
predictionMatrix[testFolds, j] =
predict(tree,
newdata = elecTrain[testFolds, c(-53)],
type = "class")
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
}
}
elecResultRates = apply(predictionMatrix, 2, function(oneSet) {
mean(oneSet==as.numeric(elecTrain$Win))
})
Next, we built a graph that dispalys the classification rate for each complexity parameter. We chose a value that was slightly
larger than the classification rate containing the smallest error.
It is important to recognize how the classification rate increase at very small complexity parameter values, then begins to
drop. It originally began to increase because with extremely small complexity parameters come too much noise, which
reduces the classiication rate. The rate began to drop after it had reached its peak of a balance between noise and enough
features to produce a high classification rate. As the complexity parameter value increased, less features were used in
predicting the election result, which continuously lowered the classification rate.
crosValResults = data.frame(cps, elecResultRates)
ggplot(data = crosValResults, aes(x = cps, y = elecResultRates)) + geom_line() + labs(x = "Comp
lexity Parameter", y = "Classification Rate") + ggtitle("Complexity Parameter Selection")
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
Final Classification Tree Creation
Lastly, we built a predictor with a complexity parameter of 0.01 on the elecTrain data set, which allowed us to predict the
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
election results for counties with similar features and calculate the classification rate in elecTest . This complexity
parameter was chosen because it minimizes the error in the training set relatively well while still being less sucesptible to
noise unique to the training data.
cpChoice = 0.01
finalTree = rpart(Win ~ .,
data = elecTrain,
method = "class",
control = rpart.control(cp = cpChoice))
testPreds = predict(finalTree,
newdata = elecTest,
type = "class")
classRate = sum(testPreds == elecTest$Win) /
nrow(elecTest)
print(paste("Classification rate:", classRate))
## [1] "Classification rate: 0.90715667311412"
prp(finalTree)
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
This classification tree presents a lot of information regarding family/household composition. This diagram allows us to
observe how past voting trends along with family/household composition was useful in predicting the 2016 election results.
Therefore, family composition was an important predictor in the 2016 election based on the data we have.
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
Predicting Changes from 2012 to 2016
Credit: Kevin Marroquin and Ryan Saraie
In this step of the analysis, we are building a predictor for the change from 2012 to 2016. The ultimate goal is to predict how
a county will change in its voting patterns from analyzing a chosen set of train data. For example, can we predict that a
county that once voted for a Democrat will switch to a Republican in 2016 through electoral, demographic and geographic
data? We will be using the K Nearest Neighbor method to establish such a predictor.
To review, the K Nearest Neighbor method is one that predicts the categories of a sample of test data from the trends of
data found in a train data set. The train set will have each row set to a specific label that categorizes said row. When the knn
test occurs, each row in the test set is analyzed by its specific data points, and from that data, the “nearest neighbor” in label
is identified. To find a nearest neighbor, the distance in similarities between the numerous variables of the data is computed
for each county in the test data.
To begin, we take a look at the data frame we established. To make the testing easier, we ensure that there were no ties in
any of the counties with the ‘sum’ function. The sum of the total of equal votes from the 2012 and 2016 elections per county
will both read 0, so we know that there were no ties in any county.
sum(elecData$dem12 == elecData$gop12)
## [1] 0
sum(elecData$dem16 == elecData$gop16)
## [1] 0
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
## Now we know that in both elections, the votes prove that either a Republican or a Democrat w
on in the entire county.
We write a function named “determine_switch” to show if a Republican won in a once Democratic county, a Democrat won in
a once Republican county, or neither. This function will be useful for establishing labels to each county for the change in
election data. With set labels for the change in election data per county, we will have some basis by which to conduct the
KNN test. The way “determine_switch” works is that it compares the 2012 votes (variables y and z) to the 2016 votes
(variables w and x), identifying a change in majority votes. For example, if a Democract won in 2012 (y > z), but a Republican
won in 2016 (w < x), the function identifies this change and returns the label “SWITCH TO REP”.
determine_switch = function (w, x, y, z) {
if ((w < x) & (y > z)) {
return("SWITCH TO REP")
} else if ((w > x) & (y < z)) {
return("SWITCH TO DEM")
} else if ((w < x) & (y < z)) {
return("NO SWITCH")
} else if ((w > x) & (y > z)) {
return("NO SWITCH")
}
}
## Note how the function looks at variables that have larger values to determine majorities. Th
is is the key idea for labeling the data.
To make the data more even for the KNN test, we convert the data values to standard units; we do this to make all of the
data values more even so that one single factor does not skew the KNN results too strongly. For example, the geographic
data in the merged data frame has some very high values; relative to the rest of the table, the geographic data could unfairly
be the main course of change for what the results of the KNN test are. Note that we keep the election results the same, as
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
when we add the “determine_switch” function to the table, the counties will be labeled differently if all of the votes are set to
standard units. We do not want incorrect labeling, so we keep votes the same. To convert the values we want to standard
units, we simply use the “apply” function, and add the scale parameter.
## Create copy
elecDataNorm <- elecData
## Convert to standard units
elecDataNorm[ ,c(11:49)] <- apply(elecData[ ,c(11:49)], 2, scale)
## Note that we keep all of the direct voting data the same. We do this to keep the voting stat
uses the same for the labels we just made, and for the sake of keeping all of the voting data c
onsistent.
After the editing, we are able to adequately label the data, and we do so by calling each of the 2012 and 2016 voting
variables into an “mapply” function. The labels will be placed on a new column in elecDataNorm, titled “switch”.
elecDataNorm$switch <- mapply(determine_switch, elecDataNorm$dem16, elecDataNorm$gop16, elecDat
aNorm$dem12, elecDataNorm$gop12)
With all of the data categorized, we are now ready to randomize our data to make an accurate K NN method. We randomize
our values by writing a function to do so, and applying said function to every element in our list. Each element in the list
represents a column in the data frame, and by setting each column to the same seed, all of the data matches to the same
county by row. We set the seed to 100.
The function that we use to randomize the data, “random_numbers”, will only be used once. First setting the seed to 100, the
function returns a sample of the data without replacement. In this method, every county is listed in the final randomization, so
we do not lose any data. Fortunately, by setting the seed and sampling the data right after, all of the data for a single county
is categorized with each other. For example, when we put the data to a list, the first data point in each element of the list will
pertain to the same county.
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
data_list = lapply(as.list(1:dim(elecDataNorm)[2]), function(x) elecDataNorm[,x[1]])
## We're turning the data frame into a list, and making each part of the list a column in the d
ata frame. This makes the randomization process simple.
data_list[1:2] <- NULL
## we don't need the state and county data for the rest of this section.
random_numbers <- function(x) {
set.seed(100)
return(sample(x, 3102, replace = FALSE))
## We set the seed to make the sample function reproducible.
## We draw 3102 without replacement to pull every single number from the list once.
}
random_list <- lapply(data_list, random_numbers)
## Taking every element in the list and ramdomizing it.
Before we do the K NN test, it’s imporant to remove all of the NA values that are present in our data frame. We look for them
by checking for NAs. Since there are NAs that exist from the 2004 election data and the “WhitePop” variable, we choose to
get rid of the rows in which said NAs exist. We remove rows instead of columns so we do not eliminate entire variables of
data. As our data is currently in a list, we convert the list to a matrix; with all of our data in a matrix, the KNN test will be
possible to conduct. We use the “simplify2array” function to convert the list to a matrix, and we use the “complete.cases”
function to get rid of NA values in each row of the matrix. As seen below, we will recieve the sums of NA values from the
dem04, gop04, and whitePop variables. This is the number of NA values we need to get rid of.
sum(is.na(elecData$dem04))
## [1] 98
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
sum(is.na(elecData$gop04))
## [1] 98
sum(is.na(elecData$whitePop))
## [1] 2
## We get rid of NA values by turning "random_list" into a matrix, and then we use complete.cas
es to eliminate every NA value.
random_matrix <- simplify2array(random_list)
## We discovered the simplify2array function from http://grokbase.com/p/r/r-help/12c4sgejgj/r-l
ist-to-matrix
random_matrix <- random_matrix[complete.cases(random_matrix), ]
## We discovered the complete.cases function from http://stackoverflow.com/questions/4862178/re
move-rows-with-nas-missing-values-in-data-frame
We can now begin the KNN test. Now that all of the data has been adequately categorized, we assign the data we need into
specific sections that are referenced in the “knn” function call. We use the “knn” function from the “class” package. The train
data will be the first 2000 counties of the randomized data, and the test data will be the remaining 1002 counties. We
organize them in this way because having a 2/3 to 1/3 split between test and train is an efficient method of ensuring that the
predictor is as accurate as possible. We want a higher proportion of test data to have more sources of obtaining data from to
make a more accurate model. This will not be the final knn test conducted in this section, so we can just use an arbitrarily
chosen k value, in this case k = 3, to ensure that the knn test works. We will discuss the most accurate k value later.
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
switch <- factor(random_matrix[ ,48])
random_matrix <- random_matrix[, -48]
## Repeating code enables coercion into proper types
switch <- factor(random_matrix[ ,48])
random_matrix <- random_matrix[, -48]
## We save the labels into a separate vector, and remove it from our matrix. We will add the la
bels when we do the K NN test.
library(class)
## The "class" package contains the knn function.
trainSet <- rbind(random_matrix[1:2000, ])
testSet <- rbind(random_matrix[2001:3002, ])
trainLabels <- switch[1:2000]
testLabels <- switch[2001:3002]
## We use the first 2/3 of the randomized data (2000/3002) as the training data, and the final
1/3 (1002/3002) as the test data.
knn_test <- knn(train = trainSet,
test = testSet,
cl = trainLabels,
k = 3, prob = TRUE)
summary(knn_test)
## NO SWITCH SWITCH TO DEM SWITCH TO REP
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
## 930 6 66
## With k = 3, the KNN test tells us that the results in most counties stay the same, with a sl
ight increase in Republican victories.
A single run of the “knn” function proves that the test can work. The “summary” function tells us that most of the counties,
based on the prediction, will not switch in the party that wins majority votes. A smaller share of the test has counties switching
to a Republican majority, and a miniscule amount switching to a Democrat majority. This gives sufficient evidence that a
Republican may win in 2016, based on the change in county data. However, all of these observations only come from the
assumption that k = 3. Ideally, we are looking for a k value that is most accurate to what the actual test data is labeled as per
county. The k value matters because it lists the number of nearest neighbors to each specific test county, and from those
nearest neighbors the label of the county is determined. With k = 3, each test county is compared to the three closest train
counties in similarity of the electoral, geographic, and demographic data. While that may give an accurate result in
conclusion, a higher accuracy results in a stronger test.
To find the most accurate k value, we analyze the output for multiple k values and plot them. To do so, we run a visualization
of the knn test by creating a Crosstable that illustrates the similarities and differences from the knn test and actual data put
into the test data, in this case listed previously as “testLabels”. The “CrossTable” function comes from the “gmodels”
package, which we previously downloaded into R. Directly below are the summary statistics for testLabels and knn_testl.
Below those is the CrossTable.
summary(as.numeric(testLabels))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.169 1.000 3.000
summary(as.numeric(knn_test))
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.138 1.000 3.000
## We compare the summaries from the raw data (test_lables) to the predicted data (knn_test). T
hey look about equal; we can further visualize the differences with a table.
## Download "gmodels"
library(gmodels)
CrossTable(testLabels, knn_test, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 1002
##
##
## | knn_test
## testLabels | NO SWITCH | SWITCH TO DEM | SWITCH TO REP | Row Total |
## --------------|---------------|---------------|---------------|---------------|
## NO SWITCH | 890 | 3 | 21 | 914 |
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
## | 0.974 | 0.003 | 0.023 | 0.912 |
## | 0.957 | 0.500 | 0.318 | |
## | 0.888 | 0.003 | 0.021 | |
## --------------|---------------|---------------|---------------|---------------|
## SWITCH TO DEM | 5 | 2 | 0 | 7 |
## | 0.714 | 0.286 | 0.000 | 0.007 |
## | 0.005 | 0.333 | 0.000 | |
## | 0.005 | 0.002 | 0.000 | |
## --------------|---------------|---------------|---------------|---------------|
## SWITCH TO REP | 35 | 1 | 45 | 81 |
## | 0.432 | 0.012 | 0.556 | 0.081 |
## | 0.038 | 0.167 | 0.682 | |
## | 0.035 | 0.001 | 0.045 | |
## --------------|---------------|---------------|---------------|---------------|
## Column Total | 930 | 6 | 66 | 1002 |
## | 0.928 | 0.006 | 0.066 | |
## --------------|---------------|---------------|---------------|---------------|
##
##
## We learned how to use cross tables from https://www.analyticsvidhya.com/blog/2015/08/learnin
g-concept-knn-algorithms-programming/
The crosstable shows the shared number of predictions in each category. From this table, we realize that we need to find the
number of knn results that are equal to the results from test labels, and find the percentage of equal values as an indicator
of accuracy. As an example, if all of the raw data had the exact same statuses from “switch” as the knn data, then the knn
test would be 100% accurate.
We predict the accuracy of the knn function for a specific k value by finding the sum of equal labels per county from the
knn_test and the actual testing data (“testLabels”). To easily list the output of accuracy from an input k, we write a function
named “accuracy _ percentage” that contains the following steps:
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
Set seed to 100.
Set the knn test to a vector (named “result”).
Using the knn test, set the accuracy test to a vector (named “accuracy”).
Return the accuracy value.
(sum(knn_test == testLabels)/1000) * 100
## [1] 93.7
## This shows us the percentage of accuracy. The remaining percentage represents the percent of
numbers from knn_test to test_labels that do not match.
accuracy_percentage <- function(x) {
set.seed(100)
## Setting the seed makes the data for this specific function reproducible.
result = knn(trainSet, testSet, trainLabels, k = x, prob = TRUE)
accuracy = ((sum(result == testLabels)/1000) * 100)
return(accuracy)
}
## The "accuracy_percentage" function plainly gives the percentage of matching values for any g
iven k value.
Now that we have a good function that can give us accuracy percentages for each k, we can plot different k values and find
the most accurate one.
The method for doing this is fairly simple. We make a line plot from 1 to 30 of each accuracy value. From what we can
determine in the graph, the most accurate k value is 8.
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
k_value <- c(1:30)
accuracy_values <- sapply(c(1:30), accuracy_percentage)
plot_values <- data.frame(k_value, accuracy_values)
## We make a data frame for the values we want to plot.
library(ggplot2)
knn_plot =
ggplot(data = plot_values,
aes(x = k_value, y = accuracy_values)) +
geom_point(stat = 'identity') +
geom_line(stat = 'identity') +
scale_x_continuous(name = "k value") +
scale_y_continuous(name = "Accuracy Percentage") +
labs(title = "Accuracy Levels of k values")
knn_plot
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
Now that we have a specific plot that shows us the percentages, we note that the most accurate level is at k = 8. That
basically means that the raw test data is most similar to the knn test data when the counties in the knn test are compared to
the 8 closest neighbors from the train data. We run the knn test for k = 8. From what we find based on the data, it is most
accurate to assume that around 953 counties will maintain their party majority in election votes from 2012, while around 0
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
counties will swtich to a Democratic majority and 49 will switch to a Republican majority. The accracy percentage of k = 8 is a
little over 94%. Ultimately, the model establishes the prediction that the 2016 Republican has a higher chance of winning the
presidency than the Democrat.
accuracy_percentage(8)
## [1] 94.2
## Accuracy is little over 94%.
set.seed(100)
knn_test <- knn(trainSet, testSet, trainLabels, k = 8, prob = TRUE)
summary(knn_test)
## NO SWITCH SWITCH TO DEM SWITCH TO REP
## 953 0 49
CrossTable(testLabels, knn_test, prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
##
##
## Total Observations in Table: 1002
##
##
## | knn_test
## testLabels | NO SWITCH | SWITCH TO REP | Row Total |
## --------------|---------------|---------------|---------------|
## NO SWITCH | 904 | 10 | 914 |
## | 0.989 | 0.011 | 0.912 |
## | 0.949 | 0.204 | |
## | 0.902 | 0.010 | |
## --------------|---------------|---------------|---------------|
## SWITCH TO DEM | 6 | 1 | 7 |
## | 0.857 | 0.143 | 0.007 |
## | 0.006 | 0.020 | |
## | 0.006 | 0.001 | |
## --------------|---------------|---------------|---------------|
## SWITCH TO REP | 43 | 38 | 81 |
## | 0.531 | 0.469 | 0.081 |
## | 0.045 | 0.776 | |
## | 0.043 | 0.038 | |
## --------------|---------------|---------------|---------------|
## Column Total | 953 | 49 | 1002 |
## | 0.951 | 0.049 | |
## --------------|---------------|---------------|---------------|
##
##
## We predict 953 counties of the test data will not switch, 0 switch to a Democrat majority, a
nd 49 switch to a Republican majority.
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
## Data proves that Trump has a higher chance of winning the election.
Results and Discussion
Credit: Kean Amidi-Abraham and Kevin Marroquin
The 2016 election was a surprise to many in the data science commmunity. The GOP victory contradicted the most popular
and accurate of predictors, all backed by a variety of credible publications. The purpose of this project was to analyze the
election results from the last four presidential elections and use prediction methods for determining the outcome of the
current election and determining what factors might lead to the outcome.
The first plot represents the voter margin in each county throughout the continental United States. The plot is color coded
based on Democratic and Republican victory and the size of the circles, located at the center of each county, are scaled to
the voter margin (the difference in votes between the competing parties). The greatest voter margins are located at the
populous urban centers throughout the country, which overwhelmingly voted Democrat. The Republican victories have
mostly small margins and are located in less populous areas of the country.
The second plot shows the total vote counts for each county in the continental United States, color coded to the winning
party and scaled to the number of votes. The white, seemingly blurry circles represent counties that are split or slightly
leaning to a certain candidate. This plot reveals the close voter margins much more visually than the first plot, which
represents close voter margins with smaller circles.
Florida, Michigan, Wisconsin, and the Carolinas contain many white, slightly red spots, which makes sense because these
are all swing states. Despite having these margins, the surrounding areas went mostly Republican.
The third plot, and arguably the most important plot, highlight the counties that flipped political parties from the 2012 to 2016
elections. The darker colored circles represent these counties, while the more transparent circles represent counties that
voted the same party. Based on the vote margin scale, the counties that flipped have relatively small margins: all less than
100,000. Glancing at the map, one may notice the majority of flipped counties as red, with only a few blue ones scattered
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
throughout the west. Most of the counties that flipped Republican are located in Iowa, Wisconsin, New York, Maine, and
other states thorough the Midwest and Northeast. These counties are ultimately what contributed to a Republican victory in
2016.
The 2016 predictor was built using a classification tree utilizing an 11 fold cross-validation. The complexity parameter for the
classification tree was varied from 0 to 0.1 and maximized to obtain an ideal value of 0.01. The plot of the classification tree
shows the variables deemed important by the classification process. This includes factors such as latitude, percentage of
women over 16 in the workforce, votes in 2012, and others. Looking at a random county that voted Republican:
# Republican county is picked using a random number generator
rand_idx = sample(dim(gopWon16)[1], 1)
gopWon16[rand_idx, c("workAtHome", "famSingleMom", "latitude", "popOver16")]
## workAtHome famSingleMom latitude popOver16
## 2686 72 11.4 -82299835 20
The variables that the tree predicts seem to be accurate. We can measure the accuracy of the classification tree.
# Variables of importance in elecData
finalTree$frame$var
## [1] workAtHome famSingleMom latitude
## [4] hhMarried <leaf> dem12
## [7] <leaf> <leaf> popOver16
## [10] <leaf> <leaf> popMarriedSeparated
## [13] dem04 <leaf> <leaf>
## [16] <leaf> workAtHome hhFamily
## [19] <leaf> <leaf> <leaf>
## 10 Levels: <leaf> dem04 dem12 famSingleMom hhFamily hhMarried ... workAtHome
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
# Following "yes" (to the left)
branch1 = elecData[elecData$workAtHome < 1104,]
branch2 = branch1[branch1$famSingleMom < 19,]
branch3 = branch2[branch2$latitude < -74e+6,]
branch4 = branch3[branch3$hhMarried >= 41,]
# Percentage of GOP win
length(which(branch4$Win == "gop"))/dim(branch4)[1]
## [1] 0.9576304
We find the tree to be 95.8% correct while utilizing these parameters.
The nearest neighbor predictor is a little over 94% accurate when using k value of 8, which is comparable to the
classification method.
# Predicting Florida
florida = elecData[elecData$state == "FL",]
branch1 = florida[florida$workAtHome < 1104,]
branch2 = branch1[branch1$famSingleMom < 19,]
branch3 = branch2[branch2$latitude < -74e+6,]
branch4 = branch3[branch3$hhMarried >= 41,]
# Percentage of GOP win
length(which(branch4$Win == "gop"))/dim(branch4)[1]
## [1] 1
References and Acknowledgements
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
Code packages used:
R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical
Computing, Vienna, Austria. URL https://www.R-project.org/.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009.
(http://docs.ggplot2.org/current/)
D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL
http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka
and Alex Deckmyn. (2016). maps: Draw Geographical Maps. R package version 3.1.1. https://CRAN.R-
project.org/package=maps
Gregory R. Warnes, Ben Bolker, Thomas Lumley, Randall C Johnson. Contributions from Randall C. Johnson are
Copyright SAIC-Frederick, Inc. Funded by the Intramural Research Program, of the NIH, National Cancer Institute and
Center for Cancer Research under NCI Contract NO1-CO-12400. (2015). gmodels: Various R Programming Tools for
Model Fitting. R package version 2.16.2. https://CRAN.R-project.org/package=gmodels
Terry Therneau, Beth Atkinson and Brian Ripley (2015). rpart: Recursive Partitioning and Regression Trees. R
package version 4.1-10. https://CRAN.R-project.org/package=rpart
Data Sources:
Data hosting courtesy of Professor Deborah Nolan, University of California Berkeley, Department of Statistics
2016 results originally available from https://github.com/tonmcg/County_Level_Election_Results_12-
16/blob/master/2016_US_County_Level_Presidential_Results.csv
2012 results originally available from Politico: http://www.politico.com/2012-election/map/#/President/2012/
2008 results originally available from The Guardian: https://www.theguardian.com/news/datablog/2009/mar/02/us-
elections-2008
pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API
2004 results available from Deborah Nolan (UC Berkeley):
http://www.stat.berkeley.edu/users/nolan/data/voteProject/countyVotes2004.txt
2010 Census data available from http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t
Geographic information available from Deborah Nolan:
http://www.stat.berkeley.edu/users/nolan/data/voteProject/counties.gml
Data Description:
http://stats.stackexchange.com/questions/121886/when-should-i-apply-feature-scaling-for-my-data
http://stats.stackexchange.com/questions/10149/converting-a-vector-to-a-vector-of-standard-units-in-r
Map Visualization:
http://stackoverflow.com/questions/6528180/ggplot2-plot-without-axes-legends-etc
http://docs.ggplot2.org/current/
Predicting 2016 Results:
Recursive partition tree code based on code provided by Professor Deborah Nolan, UC Berkeley Dept. of Statistics
Predicting Changes From 2012 to 2016:
We discovered the simplify2array function from http://grokbase.com/p/r/r-help/12c4sgejgj/r-list-to-matrix
We discovered the complete.cases function from http://stackoverflow.com/questions/4862178/remove-rows-with-nas-
missing-values-in-data-frame
We learned how to use cross tables from https://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn-
algorithms-programming/Loading [Contrib]/a11y/accessibility-menu.js

More Related Content

Viewers also liked

Yazmin Kuball 2016 Resume
Yazmin Kuball 2016 ResumeYazmin Kuball 2016 Resume
Yazmin Kuball 2016 ResumeYazmin Kuball
 
Mule dataweave
Mule dataweaveMule dataweave
Mule dataweaveSon Nguyen
 
Suspensão de divulgação pesquisa eleitoral
Suspensão de divulgação pesquisa eleitoralSuspensão de divulgação pesquisa eleitoral
Suspensão de divulgação pesquisa eleitoralAkibas De Freitas Souza
 
Adobe Digital Economy Project – October 2016
Adobe Digital Economy Project – October 2016Adobe Digital Economy Project – October 2016
Adobe Digital Economy Project – October 2016Adobe
 
Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...
Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...
Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...Philippe Villette
 
La actividad financiera del estado venezolano
La actividad financiera del estado venezolanoLa actividad financiera del estado venezolano
La actividad financiera del estado venezolanoMelvismar Garcia
 

Viewers also liked (11)

Yazmin Kuball 2016 Resume
Yazmin Kuball 2016 ResumeYazmin Kuball 2016 Resume
Yazmin Kuball 2016 Resume
 
niki1 (2)
niki1 (2)niki1 (2)
niki1 (2)
 
완성23
완성23완성23
완성23
 
Concursal.
Concursal.Concursal.
Concursal.
 
Mule dataweave
Mule dataweaveMule dataweave
Mule dataweave
 
Suspensão de divulgação pesquisa eleitoral
Suspensão de divulgação pesquisa eleitoralSuspensão de divulgação pesquisa eleitoral
Suspensão de divulgação pesquisa eleitoral
 
Adobe Digital Economy Project – October 2016
Adobe Digital Economy Project – October 2016Adobe Digital Economy Project – October 2016
Adobe Digital Economy Project – October 2016
 
design principles
design principlesdesign principles
design principles
 
Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...
Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...
Alerte a la pollution aux particules fines PM10 pour les Hautes Pyrenees pour...
 
Edital Completo
Edital CompletoEdital Completo
Edital Completo
 
La actividad financiera del estado venezolano
La actividad financiera del estado venezolanoLa actividad financiera del estado venezolano
La actividad financiera del estado venezolano
 

Similar to Visualizing Factors that Influenced the 2016 US Presidential Election

Election Project (Elep)
Election Project (Elep)Election Project (Elep)
Election Project (Elep)datamap.io
 
Global SaaS and Web Services Trends - H1 2107
Global SaaS and Web Services Trends - H1 2107Global SaaS and Web Services Trends - H1 2107
Global SaaS and Web Services Trends - H1 2107Arpit Jaiswal
 
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...GIS in the Rockies
 
Analysis of us presidential elections, 2016
  Analysis of us presidential elections, 2016  Analysis of us presidential elections, 2016
Analysis of us presidential elections, 2016Tapan Saxena
 
Election Project (ELEP)
Election Project (ELEP)Election Project (ELEP)
Election Project (ELEP)datamap.io
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Connotate
 
Data that sings Only in Seattle Presentation Oct 9 2014
Data that sings  Only in Seattle Presentation Oct 9 2014Data that sings  Only in Seattle Presentation Oct 9 2014
Data that sings Only in Seattle Presentation Oct 9 2014onlyinseattle
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsCognizant
 
2017 Contributing to Open Elections Data using R
2017 Contributing to Open Elections Data using R2017 Contributing to Open Elections Data using R
2017 Contributing to Open Elections Data using RRupal Agrawal
 
D3M Politics
D3M PoliticsD3M Politics
D3M Politicsveesingh
 
RealDirect-Data-Strategy-report
RealDirect-Data-Strategy-reportRealDirect-Data-Strategy-report
RealDirect-Data-Strategy-reportViresh Duvvuri
 
Logistics Data Analyst Internship RRD
Logistics Data Analyst Internship RRDLogistics Data Analyst Internship RRD
Logistics Data Analyst Internship RRDKatie Harvey
 
RPI Research in Linked Open Government Systems
RPI Research in Linked Open Government SystemsRPI Research in Linked Open Government Systems
RPI Research in Linked Open Government SystemsJames Hendler
 
Capstone Project in Business Intelligence
Capstone Project in Business IntelligenceCapstone Project in Business Intelligence
Capstone Project in Business IntelligenceSamantha Adriaan
 
DataBridge OCSI using your data
DataBridge OCSI using your dataDataBridge OCSI using your data
DataBridge OCSI using your datajo_ivens
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...University Politehnica Bucharest
 

Similar to Visualizing Factors that Influenced the 2016 US Presidential Election (20)

Election Project (Elep)
Election Project (Elep)Election Project (Elep)
Election Project (Elep)
 
Global SaaS and Web Services Trends - H1 2107
Global SaaS and Web Services Trends - H1 2107Global SaaS and Web Services Trends - H1 2107
Global SaaS and Web Services Trends - H1 2107
 
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
 
Analysis of us presidential elections, 2016
  Analysis of us presidential elections, 2016  Analysis of us presidential elections, 2016
Analysis of us presidential elections, 2016
 
Election Project (ELEP)
Election Project (ELEP)Election Project (ELEP)
Election Project (ELEP)
 
data, big data, open data
data, big data, open datadata, big data, open data
data, big data, open data
 
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
Employees, Business Partners and Bad Guys: What Web Data Reveals About Person...
 
Data that sings Only in Seattle Presentation Oct 9 2014
Data that sings  Only in Seattle Presentation Oct 9 2014Data that sings  Only in Seattle Presentation Oct 9 2014
Data that sings Only in Seattle Presentation Oct 9 2014
 
A Holistic Approach to Property Valuations
A Holistic Approach to Property ValuationsA Holistic Approach to Property Valuations
A Holistic Approach to Property Valuations
 
Innovations in Data for Decision Making
Innovations in Data for Decision MakingInnovations in Data for Decision Making
Innovations in Data for Decision Making
 
Data Mapping eBook
Data Mapping eBookData Mapping eBook
Data Mapping eBook
 
2017 Contributing to Open Elections Data using R
2017 Contributing to Open Elections Data using R2017 Contributing to Open Elections Data using R
2017 Contributing to Open Elections Data using R
 
D3M Politics
D3M PoliticsD3M Politics
D3M Politics
 
RealDirect-Data-Strategy-report
RealDirect-Data-Strategy-reportRealDirect-Data-Strategy-report
RealDirect-Data-Strategy-report
 
Logistics Data Analyst Internship RRD
Logistics Data Analyst Internship RRDLogistics Data Analyst Internship RRD
Logistics Data Analyst Internship RRD
 
RPI Research in Linked Open Government Systems
RPI Research in Linked Open Government SystemsRPI Research in Linked Open Government Systems
RPI Research in Linked Open Government Systems
 
Capstone Project in Business Intelligence
Capstone Project in Business IntelligenceCapstone Project in Business Intelligence
Capstone Project in Business Intelligence
 
Marketful 101
Marketful 101Marketful 101
Marketful 101
 
DataBridge OCSI using your data
DataBridge OCSI using your dataDataBridge OCSI using your data
DataBridge OCSI using your data
 
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 

Visualizing Factors that Influenced the 2016 US Presidential Election

  • 1. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API Election Visualizations Brian Thorsen 12/6/2016 ## Data has been slightly modified; geographic information is now added in correct form. load("elecDataCorrected.rda") Introduction Credit: Ryan Saraie, Ryan Haeri In this study, we initiate the process of observing data from geographic sources, four elections, and the 2010 census to observe the results and unique aspects of the 2016 presidential election. With president-elect Donald Trump’s victory remaining at the center of the United States news cycle, it is important to understand the basis of his victory and analyze what factors led to his success. To do so, we compare a wide variety of variables that may have indicated how well Mr. Trump did in certain counties of the United States. It is important to note that we are not trying to predict the 2016 election; rather, we are looking for predictors that influenced it, specifically ones that strategists and data scientists may have unaccounted for. We base our results from our own data frame, sourced from a variety of data sets that we compiled to provide a simple base for making maps and statistical tests. In our analysis, we utilize visual and computational tools to understand what patterns existed that led to a Trump presidency. From the data frame, we create multiple plots and maps that highlight voter behavior by each state. These directly visual representations show how the frequency of specific votes may have differed by county; these changes are crucial in comparing previous elections to the 2016 election, and give us a sense for what past factors were at play to influence the final election results. While the images give plenty of insight into how the 2016 election turned out, we ran a few tests to give
  • 2. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API specific predictions for what the results could be. Our first test was to create a classification tree to determine how the 2016 election would conclude in voting. With this testing method, we cross-validated our data before conducting the test as a means of obtaining more accurate results. In addition, we conducted a K Nearest Neighbor classifier in the data frame to predict the change in 2012 to 2016 election results. With the labeling parameter for the data set at if the majority party per county changed from 2012 to 2016, we compared a sample of train data with test data to make a prediction. While all of these methods of analysis may not be fully accurate models for the 2016 election, they served as informationally useful tools for understanding how certain numerical factors affected this country. Data Overview and Description Credit: Brian Thorsen The data for which we will be conducting our analysis contains information from the 2016 presidential election, reported at the county level in terms of popular vote count, as well as information for the previous three elections. Additionally, geographic information on counties as well as information from the 2010 US Census is included, such as workforce demographics, family composition, and median income for all counties. These various forms of information constitute features from which we can build predictors for the 2016 election results. Data was obtained from a variety of sources, but primarily hosted online through Professor Deborah Nolan’s website. As these sources had to be merged together, information loss arose due to counties missing from single data sources, as well as heterogeneity in county names which prevented seamless merges. In preparing our data for later analysis, several issues arose which had to be addressed. Among these were the proper handling of data being represented in the incorrect types, such as numeric data appearing as strings with added characters. Some states were missing from the initial data frames provided: namely, Alaska was not present in the 2012 data, and Virginia and Hawaii were not present in the 2004 data. For the latter two, we were able to obtain the results through parsing data sources available elsewhere on the internet. A particularly critical recurring issue was the handling of merges across the different data sources. To merge by county and state pairs, it was necessary to obtain such information from every data source, which was often represented in different
  • 3. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ways. In one instance, the use of FIPS codes enabled seamless merging of the data, whereas other merges required string manipulations to facilitate the matching of values. Additionally, an issue arose in the matter of missing values and values which did not have a directly translatable equivalent in the data frame to be merged. To address these concerns, we examined which values would be lost in the merge along the intersection of both data frames, and assessed the possibility of manual reassignment of names. For counties in which manual reassignment was not a simple task, the next step was to determine how important the counties were, qualitatively speaking. For instance, it was a much easier decision to drop the least populous counties in New Mexico from the results than to drop Manhattan or Brooklyn, NY from the results. Depending on the severity of information loss, merging operations occurred in one of two ways. If the data to be lost did not seem of high importance (e.g., low population size), an intersection merge was performed to preserve complete data and avoid the cropping up of missing values. If the counties were more important, it was preferable to preserve some information for these counties in a left join merge, and at least retain most of the data on such counties. This is admittedly a qualitative and subjective process, but our criteria is spelled out in all cases to allow the reader to view the process we chose. Ultimately, we were able to obtain nearly complete data for 3,102 counties in the United States. This will enable us to obtain robust, meaningful conclusions in our analysis. Addressing heterogeneity of units One potential issue that arises in the use of a K nearest neighbor predictor is the heterogeneity of units for the different columns of data. For instance, the voting information is reported in orders of magnitude ranging from the order of 10 for geographic information (latitude and longitude) to values between 0 and 100 for the census data. Counties which have similar demographic information will not be considered nearest neighbors with these units, as the geographic distance is by far the largest contributor to a calculation involving Euclidean distances. To address this concern, we normalize the features of the data by converting to standard units. By doing so, all features are weighed more or less equally in terms of determining the nearest neighbors of a county. Note that this issue is not as relevant for classification trees, which create bifurcating decisions based on numeric thresholds for quantitative data. These thresholds are largely independent of scale. ## Create copy 7
  • 4. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API elecDataNorm <- elecData ## Convert to standard units elecDataNorm[,c(3:49)] <- apply(elecData[,c(3:49)],2,scale) Election Result Visualization Credit: Brian Thorsen, edited by Ryan Haeri library(ggplot2) library(ggmap) library(maps) The first step in understanding the 2016 election results is to visually examine how the country voted. Three different maps plotting the election results in different ways appear below, as each map conveys different but equally important information about the election. To plot election results, we sought an alternative approach to the standard “painting” of each geographic region according to its voting patterns. To do so misrepresents the geographic clustering of the US population; for instance, the sparsely populated Montana is visually interpreted as more important than the densely populated Bay Area. To address these issues, we superimpose circles to convey popular vote information, where each circle’s area is proportional to the number of votes being represented in the circle. The first plot we will look at provides insight to how the election was decided at the state level, by plotting the marginal vote county in favor of each candidate for each county. ## Filter out Hawaii for plotting purposes elecDataNoHI <- elecData[elecData$state != "HI",]
  • 5. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## Split data by outcome to enable differentiated coloration demWon16 <- elecDataNoHI[elecDataNoHI$dem16 > elecDataNoHI$gop16,] gopWon16 <- elecDataNoHI[elecDataNoHI$gop16 > elecDataNoHI$dem16,] ## Little bit of info: hexadecimal color codes for the party colors ## come from examining the logo found on each candidate's website ## (code determined with a color selector app). ggplot(data = demWon16, mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6))) + ## Coordinate rescal ing needed ## Provided gray base US map to be layered onto borders("county", fill = "gray82", colour = "gray90") + borders("state", colour = "white", size = 0.85) + geom_point(aes(size = dem16 - gop16), alpha = 0.72, color = "#0057b8") + geom_point(data = gopWon16, mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6), size = gop16 - dem16), alpha = 0.72, color = "#bc3523") + ## Set title, legend, etc. scale_size(range = c(0, 7.5), breaks = c(100, 1000, 10000, 100000, 1000000), labels = c("100", "1,000", "10,000", "100,000", "1,000,000"), name = "County Vote Margin", guide = guide_legend(title.position = "top", override.aes = list(color = "gray30"))) + ggtitle("2016 U.S. Presidential Election: Popular Vote Margins") +
  • 6. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## Eliminate unwanted plot elements, move legend theme(axis.line=element_blank(),axis.text.x=element_blank(), axis.text.y=element_blank(),axis.ticks=element_blank(), axis.title.x=element_blank(),axis.title.y=element_blank(), panel.background=element_blank(),panel.border=element_blank(), panel.grid.major=element_blank(),panel.grid.minor=element_blank(), plot.background=element_blank(), legend.position = "bottom", legend.box = "horizontal")
  • 7. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API Examining an individual state on this map provides the information about how close the race was at the state level, and at a glance provides information of which candidate won each state (except for states with some counties favoring Clinton and others favoring Trump). However, the marginal vote gain for each candidate obscures how close a race might have been in
  • 8. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API a given region; a populous county with an extremely close vote appears as a small circle on the map. The following map examines the total vote at the county level, with each county colored according to how large the margin was in favor of each candidate. ggplot(data = elecDataNoHI, mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6))) + borders("county", fill = "gray82", colour = "gray90") + borders("state", colour = "white", size = 0.85) + geom_point(aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6), size = gop16 + dem16, color = gop16 / (gop16 + dem16)), alpha = 0.95) + scale_size(range = c(0, 15), breaks = c(200, 2000, 20000, 200000, 2000000), labels = c("200", "2,000", "20,000", "200,000", "2,000,000"), name = "Total County Vote", guide = guide_legend(title.position = "top", override.aes = list(color = "gray30"))) + scale_color_gradient2(low = "#0057b8", mid = "lightyellow", high = "#bc3523", midpoint = 0.5, name = "% Trump Vote", guide = guide_colorbar(title.position = "top")) + ggtitle("2016 U.S. Presidential Election: Popular Vote Count") + theme(axis.line=element_blank(),axis.text.x=element_blank(), axis.text.y=element_blank(),axis.ticks=element_blank(), axis.title.x=element_blank(),axis.title.y=element_blank(), panel.background=element_blank(),panel.border=element_blank(), panel.grid.major=element_blank(),panel.grid.minor=element_blank(), plot.background=element_blank(), legend.position = "bottom", legend.box = "horizontal")
  • 9. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API This map provides better indication of how close the race was in the swing states, particularly in Florida. It also shows that some larger cities in the Midwest actually were much closer races than the marginal vote might indicate (this also explains the pockets of blue in intensely conservative areas in Texas, for instance). For counties in which the race was not close,
  • 10. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API such as Los Angeles and Alameda County, this type of information is preserved between the two maps; that being said, these are not the counties we are particularly interested in. The final map highlights the counties which flipped between 2012 and 2016 in favor of the other party. These counties are the ones of interest for the sake of predicting the change in voting patterns from 2012 to 2016. ## Create individual results about staying/flipping to allow differentiated coloration demFlip <- elecDataNoHI[elecDataNoHI$dem16 > elecDataNoHI$gop16 & elecDataNoHI$dem12 < elecDataNoHI$gop12 , ] gopFlip <- elecDataNoHI[elecDataNoHI$dem16 < elecDataNoHI$gop16 & elecDataNoHI$dem12 > elecDataNoHI$gop12 , ] demStay <- elecDataNoHI[elecDataNoHI$dem16 > elecDataNoHI$gop16 & elecDataNoHI$dem12 > elecDataNoHI$gop12 , ] gopStay <- elecDataNoHI[elecDataNoHI$dem16 < elecDataNoHI$gop16 & elecDataNoHI$dem12 < elecDataNoHI$gop12 , ] ggplot(data = elecDataNoHI) + borders("county", fill = "gray82", colour = "gray90") + borders("state", colour = "white", size = 0.85) + ## Plot each case with a different color geom_point(data = demFlip, mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6), size = dem16 - gop16), alpha = 0.95, color = "#0057b8") + geom_point(data = gopFlip, mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6), size = gop16 - dem16), alpha = 0.95, color = "#bc3523") +
  • 11. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API geom_point(data = demStay, mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6), size = dem16 - gop16), alpha = 0.55, color = "lightskyblue") + geom_point(data = gopStay, mapping = aes(x = latitude / (10 ^ 6), y = longitude / (10 ^ 6), size = gop16 - dem16), alpha = 0.55, color = "lightcoral") + scale_size(range = c(0, 15), breaks = c(100, 1000, 10000, 100000, 1000000), labels = c("100", "1,000", "10,000", "100,000", "1,000,000"), name = "County Vote Margin", guide = guide_legend(title.position = "top", override.aes = list(color = "gray30"))) + ggtitle("Flipped Counties") + theme(axis.line=element_blank(),axis.text.x=element_blank(), axis.text.y=element_blank(),axis.ticks=element_blank(), axis.title.x=element_blank(),axis.title.y=element_blank(), panel.background=element_blank(),panel.border=element_blank(), panel.grid.major=element_blank(),panel.grid.minor=element_blank(), plot.background=element_blank(), legend.position = "bottom", legend.box = "horizontal")
  • 12. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API It is clear from this map that the large number of counties which flipped to the GOP in the Midwest are of special importance. Most predictors prior to the election incorrectly predicted Wisconsin and Michigan as going to Clinton, and these two states made the largest difference in deciding Trump’s victory. Additionally, few counties flipped in favor of Clinton; Orange County,
  • 13. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API CA is the most prominent example of this flip, and this occurred in a solidly blue state. One possible concern regarding the predictors is whether or not sufficient data points are available to predict a Democratic flip. For a nearest neighbor analysis, it is quite possible that a K value around 10 might prohibit any possibility of predicting a Democratic flip (if there are never enough similar counties that flipped this way to cast the majority of “votes” for a test county). As we see later in the results, this issue is fully realized; the ideal K value for classification success rates, K = 8, leads to zero counties being predicted as flipping Democratic. 2016 Election Result Predictor Credit: Ryan Haeri, minor editing by Brian Thorsen In this report, our aim was to build a classification tree based off of the 2016 election results that predicted how counties voted in comparison to each other based off of similar features. We define “similar county” by comparing counties that display relatively similar demographic features, population features by race and age, labor force size, unemployment, family characteristics, work force population by race and age, location, income, etc. Our goal was to build a predictor based off of these features in order to predict how counties with similar features voted. The 2016 election was different than all recent elections in the sense that counties that traditionally voted Democrat voted Republican. Our classification tree will also display past elections and how 2016 voted differently based off of certain county features. load("elecDataCorrected.rda") library(ggplot2) library(rpart) library(rpart.plot) In order to build our predictor, we set aside a test set to be used as our predictor during the predictor creation step. We used two-thirds of our data to build a predictor based off of the features in our data frame, and the test set’s purpose was to
  • 14. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API serve as a set of data that represented unknown future voter turnouts. First, we set a random seed and take a sample of one-thirds of the election data without replacement so we don’t have any repeated counties. We then created a new column that contains the results of those respective counties, which will be what we are testing our predictor on at the end. In order to make a test and training set, we subsetted a random sample of our indices from the original election data frame in order to obtain a random sample of two-thirds of the rows for the training set and a random sample of one-third of the rows for the test set, excluding the information we will not be examining (county and state information, along with democratic and republican voting information). We are excluding these elements from our test and training set because these are intrinsically related to the class of our data, containing information directly related to the class of our data. set.seed(12341234) numRow = nrow(elecData) dataPartition = sample(numRow, size = (1/3)*numRow, replace = FALSE) elecData$Win = factor(((elecData$dem16 > elecData$gop16)+0), levels = c(0,1), labels = c("gop", "dem")) elecTest = elecData[dataPartition, -c(1:4)] elecTrain = elecData[-dataPartition, -c(1:4)] In order to perform cross-validation, we had to partition our data into folds whose intersection is empty. They must be non- overlapping because if we have repeated information in any one row, then the predictor would give abnormally accurate results. Therefore, we permuted the indices in order to make non-overlapping folds, and created a matrix with 11 folds containing the permuted row indices. set.seed(12341234) nrowTrain = nrow(elecTrain) permutation = sample(nrowTrain, replace = FALSE)
  • 15. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API folds = matrix(permutation, ncol = 11) We wanted to build a predictor based off of 10/11 of elecTrain and predict the election results based off of 1/11 of the data. We wanted to use each complexity parameter in the process of building our tree, so we could select a tuning parameter based off of which one minimized our misclassification rate. We used a double for loop to make our predictions over the folds with our complexity parameters. When building our predictor, we hid the column that contained the election results in order to later compare the results of our predictor to our true value. We then aligned our partitions with the true values in order to see if our predictor accurately predicted the true election results. Each column represents a different complexity parameter, and we applied a function to each complexity parameter that provids the rate of accuracy. cps = c(seq(0.001, 0.01, by = 0.003), seq(0.01, 0.1, by = 0.03)) predictionMatrix = matrix(nrow = nrowTrain, ncol = length(cps)) for (i in 1:11) { testFolds = folds[ ,i] trainFolds = as.integer(folds[, -i]) for (j in 1:length(cps)) { tree = rpart(Win ~ ., data = elecTrain[trainFolds, ], method = "class", control = rpart.control(cp = cps[j])) predictionMatrix[testFolds, j] = predict(tree, newdata = elecTrain[testFolds, c(-53)], type = "class")
  • 16. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API } } elecResultRates = apply(predictionMatrix, 2, function(oneSet) { mean(oneSet==as.numeric(elecTrain$Win)) }) Next, we built a graph that dispalys the classification rate for each complexity parameter. We chose a value that was slightly larger than the classification rate containing the smallest error. It is important to recognize how the classification rate increase at very small complexity parameter values, then begins to drop. It originally began to increase because with extremely small complexity parameters come too much noise, which reduces the classiication rate. The rate began to drop after it had reached its peak of a balance between noise and enough features to produce a high classification rate. As the complexity parameter value increased, less features were used in predicting the election result, which continuously lowered the classification rate. crosValResults = data.frame(cps, elecResultRates) ggplot(data = crosValResults, aes(x = cps, y = elecResultRates)) + geom_line() + labs(x = "Comp lexity Parameter", y = "Classification Rate") + ggtitle("Complexity Parameter Selection")
  • 17. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API Final Classification Tree Creation Lastly, we built a predictor with a complexity parameter of 0.01 on the elecTrain data set, which allowed us to predict the
  • 18. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API election results for counties with similar features and calculate the classification rate in elecTest . This complexity parameter was chosen because it minimizes the error in the training set relatively well while still being less sucesptible to noise unique to the training data. cpChoice = 0.01 finalTree = rpart(Win ~ ., data = elecTrain, method = "class", control = rpart.control(cp = cpChoice)) testPreds = predict(finalTree, newdata = elecTest, type = "class") classRate = sum(testPreds == elecTest$Win) / nrow(elecTest) print(paste("Classification rate:", classRate)) ## [1] "Classification rate: 0.90715667311412" prp(finalTree)
  • 19. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API This classification tree presents a lot of information regarding family/household composition. This diagram allows us to observe how past voting trends along with family/household composition was useful in predicting the 2016 election results. Therefore, family composition was an important predictor in the 2016 election based on the data we have.
  • 20. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API Predicting Changes from 2012 to 2016 Credit: Kevin Marroquin and Ryan Saraie In this step of the analysis, we are building a predictor for the change from 2012 to 2016. The ultimate goal is to predict how a county will change in its voting patterns from analyzing a chosen set of train data. For example, can we predict that a county that once voted for a Democrat will switch to a Republican in 2016 through electoral, demographic and geographic data? We will be using the K Nearest Neighbor method to establish such a predictor. To review, the K Nearest Neighbor method is one that predicts the categories of a sample of test data from the trends of data found in a train data set. The train set will have each row set to a specific label that categorizes said row. When the knn test occurs, each row in the test set is analyzed by its specific data points, and from that data, the “nearest neighbor” in label is identified. To find a nearest neighbor, the distance in similarities between the numerous variables of the data is computed for each county in the test data. To begin, we take a look at the data frame we established. To make the testing easier, we ensure that there were no ties in any of the counties with the ‘sum’ function. The sum of the total of equal votes from the 2012 and 2016 elections per county will both read 0, so we know that there were no ties in any county. sum(elecData$dem12 == elecData$gop12) ## [1] 0 sum(elecData$dem16 == elecData$gop16) ## [1] 0
  • 21. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## Now we know that in both elections, the votes prove that either a Republican or a Democrat w on in the entire county. We write a function named “determine_switch” to show if a Republican won in a once Democratic county, a Democrat won in a once Republican county, or neither. This function will be useful for establishing labels to each county for the change in election data. With set labels for the change in election data per county, we will have some basis by which to conduct the KNN test. The way “determine_switch” works is that it compares the 2012 votes (variables y and z) to the 2016 votes (variables w and x), identifying a change in majority votes. For example, if a Democract won in 2012 (y > z), but a Republican won in 2016 (w < x), the function identifies this change and returns the label “SWITCH TO REP”. determine_switch = function (w, x, y, z) { if ((w < x) & (y > z)) { return("SWITCH TO REP") } else if ((w > x) & (y < z)) { return("SWITCH TO DEM") } else if ((w < x) & (y < z)) { return("NO SWITCH") } else if ((w > x) & (y > z)) { return("NO SWITCH") } } ## Note how the function looks at variables that have larger values to determine majorities. Th is is the key idea for labeling the data. To make the data more even for the KNN test, we convert the data values to standard units; we do this to make all of the data values more even so that one single factor does not skew the KNN results too strongly. For example, the geographic data in the merged data frame has some very high values; relative to the rest of the table, the geographic data could unfairly be the main course of change for what the results of the KNN test are. Note that we keep the election results the same, as
  • 22. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API when we add the “determine_switch” function to the table, the counties will be labeled differently if all of the votes are set to standard units. We do not want incorrect labeling, so we keep votes the same. To convert the values we want to standard units, we simply use the “apply” function, and add the scale parameter. ## Create copy elecDataNorm <- elecData ## Convert to standard units elecDataNorm[ ,c(11:49)] <- apply(elecData[ ,c(11:49)], 2, scale) ## Note that we keep all of the direct voting data the same. We do this to keep the voting stat uses the same for the labels we just made, and for the sake of keeping all of the voting data c onsistent. After the editing, we are able to adequately label the data, and we do so by calling each of the 2012 and 2016 voting variables into an “mapply” function. The labels will be placed on a new column in elecDataNorm, titled “switch”. elecDataNorm$switch <- mapply(determine_switch, elecDataNorm$dem16, elecDataNorm$gop16, elecDat aNorm$dem12, elecDataNorm$gop12) With all of the data categorized, we are now ready to randomize our data to make an accurate K NN method. We randomize our values by writing a function to do so, and applying said function to every element in our list. Each element in the list represents a column in the data frame, and by setting each column to the same seed, all of the data matches to the same county by row. We set the seed to 100. The function that we use to randomize the data, “random_numbers”, will only be used once. First setting the seed to 100, the function returns a sample of the data without replacement. In this method, every county is listed in the final randomization, so we do not lose any data. Fortunately, by setting the seed and sampling the data right after, all of the data for a single county is categorized with each other. For example, when we put the data to a list, the first data point in each element of the list will pertain to the same county.
  • 23. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API data_list = lapply(as.list(1:dim(elecDataNorm)[2]), function(x) elecDataNorm[,x[1]]) ## We're turning the data frame into a list, and making each part of the list a column in the d ata frame. This makes the randomization process simple. data_list[1:2] <- NULL ## we don't need the state and county data for the rest of this section. random_numbers <- function(x) { set.seed(100) return(sample(x, 3102, replace = FALSE)) ## We set the seed to make the sample function reproducible. ## We draw 3102 without replacement to pull every single number from the list once. } random_list <- lapply(data_list, random_numbers) ## Taking every element in the list and ramdomizing it. Before we do the K NN test, it’s imporant to remove all of the NA values that are present in our data frame. We look for them by checking for NAs. Since there are NAs that exist from the 2004 election data and the “WhitePop” variable, we choose to get rid of the rows in which said NAs exist. We remove rows instead of columns so we do not eliminate entire variables of data. As our data is currently in a list, we convert the list to a matrix; with all of our data in a matrix, the KNN test will be possible to conduct. We use the “simplify2array” function to convert the list to a matrix, and we use the “complete.cases” function to get rid of NA values in each row of the matrix. As seen below, we will recieve the sums of NA values from the dem04, gop04, and whitePop variables. This is the number of NA values we need to get rid of. sum(is.na(elecData$dem04)) ## [1] 98
  • 24. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API sum(is.na(elecData$gop04)) ## [1] 98 sum(is.na(elecData$whitePop)) ## [1] 2 ## We get rid of NA values by turning "random_list" into a matrix, and then we use complete.cas es to eliminate every NA value. random_matrix <- simplify2array(random_list) ## We discovered the simplify2array function from http://grokbase.com/p/r/r-help/12c4sgejgj/r-l ist-to-matrix random_matrix <- random_matrix[complete.cases(random_matrix), ] ## We discovered the complete.cases function from http://stackoverflow.com/questions/4862178/re move-rows-with-nas-missing-values-in-data-frame We can now begin the KNN test. Now that all of the data has been adequately categorized, we assign the data we need into specific sections that are referenced in the “knn” function call. We use the “knn” function from the “class” package. The train data will be the first 2000 counties of the randomized data, and the test data will be the remaining 1002 counties. We organize them in this way because having a 2/3 to 1/3 split between test and train is an efficient method of ensuring that the predictor is as accurate as possible. We want a higher proportion of test data to have more sources of obtaining data from to make a more accurate model. This will not be the final knn test conducted in this section, so we can just use an arbitrarily chosen k value, in this case k = 3, to ensure that the knn test works. We will discuss the most accurate k value later.
  • 25. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API switch <- factor(random_matrix[ ,48]) random_matrix <- random_matrix[, -48] ## Repeating code enables coercion into proper types switch <- factor(random_matrix[ ,48]) random_matrix <- random_matrix[, -48] ## We save the labels into a separate vector, and remove it from our matrix. We will add the la bels when we do the K NN test. library(class) ## The "class" package contains the knn function. trainSet <- rbind(random_matrix[1:2000, ]) testSet <- rbind(random_matrix[2001:3002, ]) trainLabels <- switch[1:2000] testLabels <- switch[2001:3002] ## We use the first 2/3 of the randomized data (2000/3002) as the training data, and the final 1/3 (1002/3002) as the test data. knn_test <- knn(train = trainSet, test = testSet, cl = trainLabels, k = 3, prob = TRUE) summary(knn_test) ## NO SWITCH SWITCH TO DEM SWITCH TO REP
  • 26. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## 930 6 66 ## With k = 3, the KNN test tells us that the results in most counties stay the same, with a sl ight increase in Republican victories. A single run of the “knn” function proves that the test can work. The “summary” function tells us that most of the counties, based on the prediction, will not switch in the party that wins majority votes. A smaller share of the test has counties switching to a Republican majority, and a miniscule amount switching to a Democrat majority. This gives sufficient evidence that a Republican may win in 2016, based on the change in county data. However, all of these observations only come from the assumption that k = 3. Ideally, we are looking for a k value that is most accurate to what the actual test data is labeled as per county. The k value matters because it lists the number of nearest neighbors to each specific test county, and from those nearest neighbors the label of the county is determined. With k = 3, each test county is compared to the three closest train counties in similarity of the electoral, geographic, and demographic data. While that may give an accurate result in conclusion, a higher accuracy results in a stronger test. To find the most accurate k value, we analyze the output for multiple k values and plot them. To do so, we run a visualization of the knn test by creating a Crosstable that illustrates the similarities and differences from the knn test and actual data put into the test data, in this case listed previously as “testLabels”. The “CrossTable” function comes from the “gmodels” package, which we previously downloaded into R. Directly below are the summary statistics for testLabels and knn_testl. Below those is the CrossTable. summary(as.numeric(testLabels)) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 1.000 1.000 1.169 1.000 3.000 summary(as.numeric(knn_test))
  • 27. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 1.000 1.000 1.138 1.000 3.000 ## We compare the summaries from the raw data (test_lables) to the predicted data (knn_test). T hey look about equal; we can further visualize the differences with a table. ## Download "gmodels" library(gmodels) CrossTable(testLabels, knn_test, prop.chisq = FALSE) ## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Row Total | ## | N / Col Total | ## | N / Table Total | ## |-------------------------| ## ## ## Total Observations in Table: 1002 ## ## ## | knn_test ## testLabels | NO SWITCH | SWITCH TO DEM | SWITCH TO REP | Row Total | ## --------------|---------------|---------------|---------------|---------------| ## NO SWITCH | 890 | 3 | 21 | 914 |
  • 28. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## | 0.974 | 0.003 | 0.023 | 0.912 | ## | 0.957 | 0.500 | 0.318 | | ## | 0.888 | 0.003 | 0.021 | | ## --------------|---------------|---------------|---------------|---------------| ## SWITCH TO DEM | 5 | 2 | 0 | 7 | ## | 0.714 | 0.286 | 0.000 | 0.007 | ## | 0.005 | 0.333 | 0.000 | | ## | 0.005 | 0.002 | 0.000 | | ## --------------|---------------|---------------|---------------|---------------| ## SWITCH TO REP | 35 | 1 | 45 | 81 | ## | 0.432 | 0.012 | 0.556 | 0.081 | ## | 0.038 | 0.167 | 0.682 | | ## | 0.035 | 0.001 | 0.045 | | ## --------------|---------------|---------------|---------------|---------------| ## Column Total | 930 | 6 | 66 | 1002 | ## | 0.928 | 0.006 | 0.066 | | ## --------------|---------------|---------------|---------------|---------------| ## ## ## We learned how to use cross tables from https://www.analyticsvidhya.com/blog/2015/08/learnin g-concept-knn-algorithms-programming/ The crosstable shows the shared number of predictions in each category. From this table, we realize that we need to find the number of knn results that are equal to the results from test labels, and find the percentage of equal values as an indicator of accuracy. As an example, if all of the raw data had the exact same statuses from “switch” as the knn data, then the knn test would be 100% accurate. We predict the accuracy of the knn function for a specific k value by finding the sum of equal labels per county from the knn_test and the actual testing data (“testLabels”). To easily list the output of accuracy from an input k, we write a function named “accuracy _ percentage” that contains the following steps:
  • 29. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API Set seed to 100. Set the knn test to a vector (named “result”). Using the knn test, set the accuracy test to a vector (named “accuracy”). Return the accuracy value. (sum(knn_test == testLabels)/1000) * 100 ## [1] 93.7 ## This shows us the percentage of accuracy. The remaining percentage represents the percent of numbers from knn_test to test_labels that do not match. accuracy_percentage <- function(x) { set.seed(100) ## Setting the seed makes the data for this specific function reproducible. result = knn(trainSet, testSet, trainLabels, k = x, prob = TRUE) accuracy = ((sum(result == testLabels)/1000) * 100) return(accuracy) } ## The "accuracy_percentage" function plainly gives the percentage of matching values for any g iven k value. Now that we have a good function that can give us accuracy percentages for each k, we can plot different k values and find the most accurate one. The method for doing this is fairly simple. We make a line plot from 1 to 30 of each accuracy value. From what we can determine in the graph, the most accurate k value is 8.
  • 30. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API k_value <- c(1:30) accuracy_values <- sapply(c(1:30), accuracy_percentage) plot_values <- data.frame(k_value, accuracy_values) ## We make a data frame for the values we want to plot. library(ggplot2) knn_plot = ggplot(data = plot_values, aes(x = k_value, y = accuracy_values)) + geom_point(stat = 'identity') + geom_line(stat = 'identity') + scale_x_continuous(name = "k value") + scale_y_continuous(name = "Accuracy Percentage") + labs(title = "Accuracy Levels of k values") knn_plot
  • 31. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API Now that we have a specific plot that shows us the percentages, we note that the most accurate level is at k = 8. That basically means that the raw test data is most similar to the knn test data when the counties in the knn test are compared to the 8 closest neighbors from the train data. We run the knn test for k = 8. From what we find based on the data, it is most accurate to assume that around 953 counties will maintain their party majority in election votes from 2012, while around 0
  • 32. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API counties will swtich to a Democratic majority and 49 will switch to a Republican majority. The accracy percentage of k = 8 is a little over 94%. Ultimately, the model establishes the prediction that the 2016 Republican has a higher chance of winning the presidency than the Democrat. accuracy_percentage(8) ## [1] 94.2 ## Accuracy is little over 94%. set.seed(100) knn_test <- knn(trainSet, testSet, trainLabels, k = 8, prob = TRUE) summary(knn_test) ## NO SWITCH SWITCH TO DEM SWITCH TO REP ## 953 0 49 CrossTable(testLabels, knn_test, prop.chisq = FALSE) ## ## ## Cell Contents ## |-------------------------| ## | N | ## | N / Row Total | ## | N / Col Total | ## | N / Table Total | ## |-------------------------|
  • 33. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## ## ## Total Observations in Table: 1002 ## ## ## | knn_test ## testLabels | NO SWITCH | SWITCH TO REP | Row Total | ## --------------|---------------|---------------|---------------| ## NO SWITCH | 904 | 10 | 914 | ## | 0.989 | 0.011 | 0.912 | ## | 0.949 | 0.204 | | ## | 0.902 | 0.010 | | ## --------------|---------------|---------------|---------------| ## SWITCH TO DEM | 6 | 1 | 7 | ## | 0.857 | 0.143 | 0.007 | ## | 0.006 | 0.020 | | ## | 0.006 | 0.001 | | ## --------------|---------------|---------------|---------------| ## SWITCH TO REP | 43 | 38 | 81 | ## | 0.531 | 0.469 | 0.081 | ## | 0.045 | 0.776 | | ## | 0.043 | 0.038 | | ## --------------|---------------|---------------|---------------| ## Column Total | 953 | 49 | 1002 | ## | 0.951 | 0.049 | | ## --------------|---------------|---------------|---------------| ## ## ## We predict 953 counties of the test data will not switch, 0 switch to a Democrat majority, a nd 49 switch to a Republican majority.
  • 34. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API ## Data proves that Trump has a higher chance of winning the election. Results and Discussion Credit: Kean Amidi-Abraham and Kevin Marroquin The 2016 election was a surprise to many in the data science commmunity. The GOP victory contradicted the most popular and accurate of predictors, all backed by a variety of credible publications. The purpose of this project was to analyze the election results from the last four presidential elections and use prediction methods for determining the outcome of the current election and determining what factors might lead to the outcome. The first plot represents the voter margin in each county throughout the continental United States. The plot is color coded based on Democratic and Republican victory and the size of the circles, located at the center of each county, are scaled to the voter margin (the difference in votes between the competing parties). The greatest voter margins are located at the populous urban centers throughout the country, which overwhelmingly voted Democrat. The Republican victories have mostly small margins and are located in less populous areas of the country. The second plot shows the total vote counts for each county in the continental United States, color coded to the winning party and scaled to the number of votes. The white, seemingly blurry circles represent counties that are split or slightly leaning to a certain candidate. This plot reveals the close voter margins much more visually than the first plot, which represents close voter margins with smaller circles. Florida, Michigan, Wisconsin, and the Carolinas contain many white, slightly red spots, which makes sense because these are all swing states. Despite having these margins, the surrounding areas went mostly Republican. The third plot, and arguably the most important plot, highlight the counties that flipped political parties from the 2012 to 2016 elections. The darker colored circles represent these counties, while the more transparent circles represent counties that voted the same party. Based on the vote margin scale, the counties that flipped have relatively small margins: all less than 100,000. Glancing at the map, one may notice the majority of flipped counties as red, with only a few blue ones scattered
  • 35. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API throughout the west. Most of the counties that flipped Republican are located in Iowa, Wisconsin, New York, Maine, and other states thorough the Midwest and Northeast. These counties are ultimately what contributed to a Republican victory in 2016. The 2016 predictor was built using a classification tree utilizing an 11 fold cross-validation. The complexity parameter for the classification tree was varied from 0 to 0.1 and maximized to obtain an ideal value of 0.01. The plot of the classification tree shows the variables deemed important by the classification process. This includes factors such as latitude, percentage of women over 16 in the workforce, votes in 2012, and others. Looking at a random county that voted Republican: # Republican county is picked using a random number generator rand_idx = sample(dim(gopWon16)[1], 1) gopWon16[rand_idx, c("workAtHome", "famSingleMom", "latitude", "popOver16")] ## workAtHome famSingleMom latitude popOver16 ## 2686 72 11.4 -82299835 20 The variables that the tree predicts seem to be accurate. We can measure the accuracy of the classification tree. # Variables of importance in elecData finalTree$frame$var ## [1] workAtHome famSingleMom latitude ## [4] hhMarried <leaf> dem12 ## [7] <leaf> <leaf> popOver16 ## [10] <leaf> <leaf> popMarriedSeparated ## [13] dem04 <leaf> <leaf> ## [16] <leaf> workAtHome hhFamily ## [19] <leaf> <leaf> <leaf> ## 10 Levels: <leaf> dem04 dem12 famSingleMom hhFamily hhMarried ... workAtHome
  • 36. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API # Following "yes" (to the left) branch1 = elecData[elecData$workAtHome < 1104,] branch2 = branch1[branch1$famSingleMom < 19,] branch3 = branch2[branch2$latitude < -74e+6,] branch4 = branch3[branch3$hhMarried >= 41,] # Percentage of GOP win length(which(branch4$Win == "gop"))/dim(branch4)[1] ## [1] 0.9576304 We find the tree to be 95.8% correct while utilizing these parameters. The nearest neighbor predictor is a little over 94% accurate when using k value of 8, which is comparable to the classification method. # Predicting Florida florida = elecData[elecData$state == "FL",] branch1 = florida[florida$workAtHome < 1104,] branch2 = branch1[branch1$famSingleMom < 19,] branch3 = branch2[branch2$latitude < -74e+6,] branch4 = branch3[branch3$hhMarried >= 41,] # Percentage of GOP win length(which(branch4$Win == "gop"))/dim(branch4)[1] ## [1] 1 References and Acknowledgements
  • 37. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API Code packages used: R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009. (http://docs.ggplot2.org/current/) D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf Original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn. (2016). maps: Draw Geographical Maps. R package version 3.1.1. https://CRAN.R- project.org/package=maps Gregory R. Warnes, Ben Bolker, Thomas Lumley, Randall C Johnson. Contributions from Randall C. Johnson are Copyright SAIC-Frederick, Inc. Funded by the Intramural Research Program, of the NIH, National Cancer Institute and Center for Cancer Research under NCI Contract NO1-CO-12400. (2015). gmodels: Various R Programming Tools for Model Fitting. R package version 2.16.2. https://CRAN.R-project.org/package=gmodels Terry Therneau, Beth Atkinson and Brian Ripley (2015). rpart: Recursive Partitioning and Regression Trees. R package version 4.1-10. https://CRAN.R-project.org/package=rpart Data Sources: Data hosting courtesy of Professor Deborah Nolan, University of California Berkeley, Department of Statistics 2016 results originally available from https://github.com/tonmcg/County_Level_Election_Results_12- 16/blob/master/2016_US_County_Level_Presidential_Results.csv 2012 results originally available from Politico: http://www.politico.com/2012-election/map/#/President/2012/ 2008 results originally available from The Guardian: https://www.theguardian.com/news/datablog/2009/mar/02/us- elections-2008
  • 38. pdfcrowd.comPRO version Are you a developer? Try out the HTML to PDF API 2004 results available from Deborah Nolan (UC Berkeley): http://www.stat.berkeley.edu/users/nolan/data/voteProject/countyVotes2004.txt 2010 Census data available from http://factfinder2.census.gov/faces/nav/jsf/pages/searchresults.xhtml?refresh=t Geographic information available from Deborah Nolan: http://www.stat.berkeley.edu/users/nolan/data/voteProject/counties.gml Data Description: http://stats.stackexchange.com/questions/121886/when-should-i-apply-feature-scaling-for-my-data http://stats.stackexchange.com/questions/10149/converting-a-vector-to-a-vector-of-standard-units-in-r Map Visualization: http://stackoverflow.com/questions/6528180/ggplot2-plot-without-axes-legends-etc http://docs.ggplot2.org/current/ Predicting 2016 Results: Recursive partition tree code based on code provided by Professor Deborah Nolan, UC Berkeley Dept. of Statistics Predicting Changes From 2012 to 2016: We discovered the simplify2array function from http://grokbase.com/p/r/r-help/12c4sgejgj/r-list-to-matrix We discovered the complete.cases function from http://stackoverflow.com/questions/4862178/remove-rows-with-nas- missing-values-in-data-frame We learned how to use cross tables from https://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn- algorithms-programming/Loading [Contrib]/a11y/accessibility-menu.js