2. Background - Election data in the US
• Election results are not reported by any single federal agency
• Instead, each state & county reports in a variety of formats --
HTML, PDF, CSV, often with very different layouts and varying
levels of granularity
• Number of elections, besides the Presidential – primaries for
each party, mid-term and special for various offices (US Senate,
US House, State legislatures, Governor, etc.)
• There is no freely available comprehensive source of official
election results, for people to use for analysis or journalists for
reporting
• Article: “Elections: The final frontier of open data?”
• https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/
2
3. About OpenElections
• Goal of this Open Data effort “to create the first free,
comprehensive, standardized, linked set of election data for
the United States, including federal, statewide and state
legislative offices”
• Website openelections.net (not current, need volunteers)
• docs.openelections.net (instructions to contribute)
• Github Page (updated regularly)
• Contains latest work in progress
• Separate repo for each state
• Processed data by year, election
• Instructions for contributors
• Contributed code/scripts mainly in Python
• Issue tracking
3
@openelex
4. Motivation
• I have been volunteering with OpenElections towards
creating such a source
• I use R to automate some of these tasks - web-scraping, PDF
conversion and for data manipulation to produce the desired
outputs in a consistent format
• In this lightning talk, using real examples from multiple US
states, I will highlight some of the challenges I faced
• I will also share some of the R packages I used – RSelenium,
XML, pdftools, tabulizer, dplyr, tidyr, data.table aimed to help
others wishing to volunteer with similar Open Data efforts
4
5. Desired output format (csv)
1. County
2. Precinct (if available)
3. Office (President, U.S. Senate, U.S. House, State
Senate, State House, Attorney General, etc)
4. District (# for U.S. House district or State
Senate or State House district)
5. Party (DEM, REP, LIB, OTH…)
6. Candidate (names of candidates)
7. Votes (# of votes received)
OpenElections specifies a standardized format for the desired output
5
Output Format
6. Let’s take a look at 4 US States
IOWA ALASKA
TENNESSEE
MISSOURI
8. Iowa 2016 General Election all Races at Precinct-level (txt file)
Let’s start with an easy case
Sample Input Data is available as text file in Wide format - Data Manipulation only
(shown below in Excel for ease of reading)
length(unique(long_DF$RaceTitle))
[1] 197
5000+
columns
county+precinct+votetypeoffice+district
8
IOWA
9. Convert Wide file to long file using tidyR package (gather command) so each countyprecinct is in a separate row
long_DF <- df %>% gather(countyprecinct, Votes, c(4:5119))
Sample relevant commands (actual code is more elaborate)
Challenges along the way
• Countyprecinct in input file was
separated by “-” but precinct
names also contained “-”
• Absentee, Polling & Total votes
needed to be retained
9
IOWA
12. Another sample file – Alaska 2016 General Election at Precinct-level (data manipulation only)
Like IA file shown earlier
this is also a csv file but
layout is different and
new custom code is
needed to process it
12
ALASKA
13. Even the same state changes format and layout of results from one election to next
Alaska 2012 General Election results in csv are only at District-level
(and different layout/columns from 2016)
To get precinct-level results, need to process
40 PDFs – one for each county (district)
13
ALASKA
14. Used
• pdftools package - pdf_text
• Tabulizer package - extract_tables, extract_text
Abandoned after trying out variety of ways to get a consistent pattern across
multiple pages and files in order that I could extract data via a script
14
ALASKA
16. Tennessee 2004 General Elections
votes
office
candidate
party
county
precinct
candidate
party=OTH
1 single election results
available in 4 distinct
PDFs, each with dozens of
pages
16
TENNESSEE
17. TENNESSEE
district
Multiple races in a single PDF
Varying Number of candidates per race
Determining where a new race has
started is not straightforward
candidate
Click for TN election results website
http://sos.tn.gov/products/elections/election-results
17
18. Pseudo code for TN PDF
• Download file, read
• Convert PDF to free-form text
• Find separators for race, page, county
• Determine number of races, pages, counties per race
• Determine number of candidates per race
• Determine number of rows and columns taken up by
candidate names
• Find number of precincts by race
• Tokenize and Compute number of words in each
precinct name
• Create list of candidates by district
• Merge main data frame with candidates df
• Remove unwanted rows
• Transform and standardize into desired format 18
TENNESSEE
txt <- pdf_text(filename)
#' Store the whole pdf in one dataframe of 1 column
df <- read.csv(textConnection(txt), sep="n", header=F,
stringsAsFactors = F)
## Find out how many candidates per Race & how many rows for candidate names
## logic for num_cand is based on number of columns for vote counts
## example, searching for row before "COUNTY" and see 1 2 3 4...and take max
## logic for numrows_col1 is based on count of rows between race name
## & vote count column headers
a <- df %>%
group_by(Race) %>%
mutate(key = grep("COUNTY", V1)[1]-1, #row prior to first match
num_cand = as.numeric(max(unlist(strsplit(V1[key], split="")),
na.rm=T)),
numrows_col1 = key - CANDIDATE_BLK_EXTRA_LINES, #
diff = (num_cand == numrows_col1) # catch where num of candidates
# is diff from extra rows between race & vote headers
) %>%
select(-key)
20. 20
7 Candidates, listed in 2
columns, 5 rows
TENNESSEE
Candidate names in 4 rows,
3 columns
Party handled differently.
There is yet another example (not shown) with >10 candidates
that a single row (precinct) goes across multiple pages!
21. Wrote a bunch of helper functions like these below
Input parameters
21
TENNESSEE
22. Multiple lines for a candidate
One of the many interesting challenges along the way
# create new df with names of candidates by district
c2 <- candidate_list
candidate_list <- b %>%
group_by(district) %>%
slice(2:(numrows_col1 + 1)) %>%
select(V1, district, num_cand, numrows_col1)
clean_cand <- create_list_candidates_and_numbers(candidate_list)
candidate_list1 <- clean_cand %>%
separate(Candidate, c("Candidate", "party"),
sep = " . ") %>%
unite(dist_cand, district, Number,
sep = "_Z_", remove = TRUE)
Input PDF
Appears as 2
candidates!
DF
Sample code
22
TENNESSEE
27. office
candidate party votes
district
Convert and Transform table raw data into desired format
After 100+ html pages extraction and
manipulation county-level (not precinct-
level) data from 1 election ready!
27
MO
28. remDrv <- remoteDriver(browserName = 'phantomjs') #instantiate new
remoteDriver
remDrv$open() # open method in remoteDriver class
url <- 'http://enrarchives.sos.mo.gov/enrnet/CountyResults.aspx'
# Simulate browser session and fill out form
remDrv$navigate(url) #send headless browser to url
#Select the Election from DROPDOWN using id in xpath
elec_xp <- paste0('//*[@id="cboElectionNames"]/option[' ,
selected_election , ']')
remDrv$findElement(using = "xpath", elec_xp)$clickElement()
#election is set
# ---- Click the button to select the Election
eBTN <- '//*[@id="MainContent_btnElectionType"]'
remDrv$findElement(using = 'xpath', eBTN)$clickElement()
Use RSelenium package to simulate headless browser
• Initialize browser session
• Go to URL
• Select Election name from Dropdown
• Click Choose Election button
• Select County name from Dropdown
• Click Submit button
• Get HTML Data for selected Election and County
• Process HTML and Extract Table
• Convert to Raw Data (readHTMLTable())
• Transform raw data into desired format
• Repeat for all counties for that Election
## Get the HTML data from the page and process it using XML package
raw <- remDrv$getPageSource()[[1]]
counties_val <- xpathSApply(htmlParse(raw),
'//*[@id="cboCounty"]/option', xmlAttrs)
chosen_county <- grep("selected", counties_val)
#Extract the Table (Election results)
resTable <- raw %>% readHTMLTable()
resDf <- resTable[[1]] # return desired data frame from list
of tables
28
MO
29. Conclusions & Takeaways
• Great way to learn and contribute
• Pdftools – Good package for extracting text data from PDFs
• Tabulizer – Useful package for extracting tabular data from PDFs
• RSelenium, XML – Great packages for web-scraping with (simulating) forms
• Lots of work still needs to be done for recent elections (2000-2016) across
all states
• 50 states, 100s of input files in a variety of formats per state
• Meaningful analysis can be done by data scientists once data is available
• Presidential election results gets a lot of attention, but other races are
arguably as important
29