SlideShare a Scribd company logo
Contributing to OpenElections
(Open Data) Using R
Rupal Agrawal
BARUG Meetup
February 2017
Background - Election data in the US
• Election results are not reported by any single federal agency
• Instead, each state & county reports in a variety of formats --
HTML, PDF, CSV, often with very different layouts and varying
levels of granularity
• Number of elections, besides the Presidential – primaries for
each party, mid-term and special for various offices (US Senate,
US House, State legislatures, Governor, etc.)
• There is no freely available comprehensive source of official
election results, for people to use for analysis or journalists for
reporting
• Article: “Elections: The final frontier of open data?”
• https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/
2
About OpenElections
• Goal of this Open Data effort “to create the first free,
comprehensive, standardized, linked set of election data for
the United States, including federal, statewide and state
legislative offices”
• Website openelections.net (not current, need volunteers)
• docs.openelections.net (instructions to contribute)
• Github Page (updated regularly)
• Contains latest work in progress
• Separate repo for each state
• Processed data by year, election
• Instructions for contributors
• Contributed code/scripts mainly in Python
• Issue tracking
3
@openelex
Motivation
• I have been volunteering with OpenElections towards
creating such a source
• I use R to automate some of these tasks - web-scraping, PDF
conversion and for data manipulation to produce the desired
outputs in a consistent format
• In this lightning talk, using real examples from multiple US
states, I will highlight some of the challenges I faced
• I will also share some of the R packages I used – RSelenium,
XML, pdftools, tabulizer, dplyr, tidyr, data.table aimed to help
others wishing to volunteer with similar Open Data efforts
4
Desired output format (csv)
1. County
2. Precinct (if available)
3. Office (President, U.S. Senate, U.S. House, State
Senate, State House, Attorney General, etc)
4. District (# for U.S. House district or State
Senate or State House district)
5. Party (DEM, REP, LIB, OTH…)
6. Candidate (names of candidates)
7. Votes (# of votes received)
OpenElections specifies a standardized format for the desired output
5
Output Format
Let’s take a look at 4 US States
IOWA ALASKA
TENNESSEE
MISSOURI
IOWA
Iowa 2016 General Election all Races at Precinct-level (txt file)
Let’s start with an easy case
Sample Input Data is available as text file in Wide format - Data Manipulation only
(shown below in Excel for ease of reading)
 length(unique(long_DF$RaceTitle))
 [1] 197
5000+
columns
county+precinct+votetypeoffice+district
8
IOWA
Convert Wide file to long file using tidyR package (gather command) so each countyprecinct is in a separate row
long_DF <- df %>% gather(countyprecinct, Votes, c(4:5119))
Sample relevant commands (actual code is more elaborate)
Challenges along the way
• Countyprecinct in input file was
separated by “-” but precinct
names also contained “-”
• Absentee, Polling & Total votes
needed to be retained
9
IOWA
Split combined columns like RaceTitle and countyprecinct into individual columns
separate_DF <- long_DF %>% separate(RaceTitle, c("office", "district"), sep = " Dist. ")
separate_DF %>% separate(countyprecinct, c("county", "precinct"), sep = "ZZZZ")
cbind(outputT, outputAbs$absentee_votes, outputP$polling_votes)
10
IOWA
ALASKA
Another sample file – Alaska 2016 General Election at Precinct-level (data manipulation only)
Like IA file shown earlier
this is also a csv file but
layout is different and
new custom code is
needed to process it
12
ALASKA
Even the same state changes format and layout of results from one election to next
Alaska 2012 General Election results in csv are only at District-level
(and different layout/columns from 2016)
To get precinct-level results, need to process
40 PDFs – one for each county (district)
13
ALASKA
Used
• pdftools package - pdf_text
• Tabulizer package - extract_tables, extract_text
Abandoned after trying out variety of ways to get a consistent pattern across
multiple pages and files in order that I could extract data via a script
14
ALASKA
TENNESSEE
Tennessee 2004 General Elections
votes
office
candidate
party
county
precinct
candidate
party=OTH
1 single election results
available in 4 distinct
PDFs, each with dozens of
pages
16
TENNESSEE
TENNESSEE
district
Multiple races in a single PDF
Varying Number of candidates per race
Determining where a new race has
started is not straightforward
candidate
Click for TN election results website
http://sos.tn.gov/products/elections/election-results
17
Pseudo code for TN PDF
• Download file, read
• Convert PDF to free-form text
• Find separators for race, page, county
• Determine number of races, pages, counties per race
• Determine number of candidates per race
• Determine number of rows and columns taken up by
candidate names
• Find number of precincts by race
• Tokenize and Compute number of words in each
precinct name
• Create list of candidates by district
• Merge main data frame with candidates df
• Remove unwanted rows
• Transform and standardize into desired format 18
TENNESSEE
txt <- pdf_text(filename)
#' Store the whole pdf in one dataframe of 1 column
df <- read.csv(textConnection(txt), sep="n", header=F,
stringsAsFactors = F)
## Find out how many candidates per Race & how many rows for candidate names
## logic for num_cand is based on number of columns for vote counts
## example, searching for row before "COUNTY" and see 1 2 3 4...and take max
## logic for numrows_col1 is based on count of rows between race name
## & vote count column headers
a <- df %>%
group_by(Race) %>%
mutate(key = grep("COUNTY", V1)[1]-1, #row prior to first match
num_cand = as.numeric(max(unlist(strsplit(V1[key], split="")),
na.rm=T)),
numrows_col1 = key - CANDIDATE_BLK_EXTRA_LINES, #
diff = (num_cand == numrows_col1) # catch where num of candidates
# is diff from extra rows between race & vote headers
) %>%
select(-key)
INITIAL DF
INTERMEDIATE DF
19
TENNESSEE
20
7 Candidates, listed in 2
columns, 5 rows
TENNESSEE
Candidate names in 4 rows,
3 columns
Party handled differently.
There is yet another example (not shown) with >10 candidates
that a single row (precinct) goes across multiple pages!
Wrote a bunch of helper functions like these below
Input parameters
21
TENNESSEE
Multiple lines for a candidate
One of the many interesting challenges along the way
# create new df with names of candidates by district
c2 <- candidate_list
candidate_list <- b %>%
group_by(district) %>%
slice(2:(numrows_col1 + 1)) %>%
select(V1, district, num_cand, numrows_col1)
clean_cand <- create_list_candidates_and_numbers(candidate_list)
candidate_list1 <- clean_cand %>%
separate(Candidate, c("Candidate", "party"),
sep = " . ") %>%
unite(dist_cand, district, Number,
sep = "_Z_", remove = TRUE)
Input PDF
Appears as 2
candidates!
DF
Sample code
22
TENNESSEE
MISSOURI
Missouri 2016 Primary Elections (at county level) - HTML MO
24
100+ counties in dropdown
Note: URL doesn’t change with selections
25
MO
county
26
MO
office
candidate party votes
district
Convert and Transform table raw data into desired format
After 100+ html pages extraction and
manipulation county-level (not precinct-
level) data from 1 election ready!
27
MO
remDrv <- remoteDriver(browserName = 'phantomjs') #instantiate new
remoteDriver
remDrv$open() # open method in remoteDriver class
url <- 'http://enrarchives.sos.mo.gov/enrnet/CountyResults.aspx'
# Simulate browser session and fill out form
remDrv$navigate(url) #send headless browser to url
#Select the Election from DROPDOWN using id in xpath
elec_xp <- paste0('//*[@id="cboElectionNames"]/option[' ,
selected_election , ']')
remDrv$findElement(using = "xpath", elec_xp)$clickElement()
#election is set
# ---- Click the button to select the Election
eBTN <- '//*[@id="MainContent_btnElectionType"]'
remDrv$findElement(using = 'xpath', eBTN)$clickElement()
Use RSelenium package to simulate headless browser
• Initialize browser session
• Go to URL
• Select Election name from Dropdown
• Click Choose Election button
• Select County name from Dropdown
• Click Submit button
• Get HTML Data for selected Election and County
• Process HTML and Extract Table
• Convert to Raw Data (readHTMLTable())
• Transform raw data into desired format
• Repeat for all counties for that Election
## Get the HTML data from the page and process it using XML package
raw <- remDrv$getPageSource()[[1]]
counties_val <- xpathSApply(htmlParse(raw),
'//*[@id="cboCounty"]/option', xmlAttrs)
chosen_county <- grep("selected", counties_val)
#Extract the Table (Election results)
resTable <- raw %>% readHTMLTable()
resDf <- resTable[[1]] # return desired data frame from list
of tables
28
MO
Conclusions & Takeaways
• Great way to learn and contribute
• Pdftools – Good package for extracting text data from PDFs
• Tabulizer – Useful package for extracting tabular data from PDFs
• RSelenium, XML – Great packages for web-scraping with (simulating) forms
• Lots of work still needs to be done for recent elections (2000-2016) across
all states
• 50 states, 100s of input files in a variety of formats per state
• Meaningful analysis can be done by data scientists once data is available
• Presidential election results gets a lot of attention, but other races are
arguably as important
29
Questions?
Rupal Agrawal
rupal_agrawal@yahoo.com
30
@openelex
docs.openelections.net
Info on OpenElections:
https://github.com/openelections

More Related Content

Viewers also liked

Open Local Data Presentation
Open Local Data PresentationOpen Local Data Presentation
Open Local Data Presentation
Chris Taggart
 
Previous project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and JudgesPrevious project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and Judges
sathishkumar supermaniam
 
Group assigment statistic group3
Group assigment statistic group3Group assigment statistic group3
Group assigment statistic group3
Narith Por
 
Produccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2AProduccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2A
Gaelmontano41
 
Blame 032
Blame 032Blame 032
Blame 032
comicgo
 
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...Amir Rezazadeh
 
Alba Resumé
Alba ResuméAlba Resumé
Alba Resumé
Alba Figueroa-burgos
 
Enrichment statistics project
Enrichment statistics projectEnrichment statistics project
Enrichment statistics project
Elizabeth Walker
 
Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2
Jean-Paul Thomas
 
Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4
Jean-Paul Thomas
 
Jorge Enrique Adoum
Jorge Enrique AdoumJorge Enrique Adoum
Jorge Enrique Adoum
Nicolás Svistoonoff
 
Ap statistics final project
Ap statistics final projectAp statistics final project
Ap statistics final project
eseuwhu1
 
Programazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbhProgramazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbh
trutxete
 
Statistic project 22
Statistic project 22Statistic project 22
Statistic project 22
Jenny Lee
 
100 mambos and merengues
100 mambos and merengues100 mambos and merengues
100 mambos and merengues
Partitura de Banda
 
Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.
Bill Barrick
 
ABBA Gold
ABBA GoldABBA Gold

Viewers also liked (17)

Open Local Data Presentation
Open Local Data PresentationOpen Local Data Presentation
Open Local Data Presentation
 
Previous project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and JudgesPrevious project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and Judges
 
Group assigment statistic group3
Group assigment statistic group3Group assigment statistic group3
Group assigment statistic group3
 
Produccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2AProduccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2A
 
Blame 032
Blame 032Blame 032
Blame 032
 
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
 
Alba Resumé
Alba ResuméAlba Resumé
Alba Resumé
 
Enrichment statistics project
Enrichment statistics projectEnrichment statistics project
Enrichment statistics project
 
Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2
 
Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4
 
Jorge Enrique Adoum
Jorge Enrique AdoumJorge Enrique Adoum
Jorge Enrique Adoum
 
Ap statistics final project
Ap statistics final projectAp statistics final project
Ap statistics final project
 
Programazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbhProgramazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbh
 
Statistic project 22
Statistic project 22Statistic project 22
Statistic project 22
 
100 mambos and merengues
100 mambos and merengues100 mambos and merengues
100 mambos and merengues
 
Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.
 
ABBA Gold
ABBA GoldABBA Gold
ABBA Gold
 

Similar to 2017 Contributing to Open Elections Data using R

Election Project (Elep)
Election Project (Elep)Election Project (Elep)
Election Project (Elep)
datamap.io
 
Analysis of us presidential elections, 2016
  Analysis of us presidential elections, 2016  Analysis of us presidential elections, 2016
Analysis of us presidential elections, 2016
Tapan Saxena
 
Election Project (ELEP)
Election Project (ELEP)Election Project (ELEP)
Election Project (ELEP)
datamap.io
 
Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"
Lynda Kellam
 
Final%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlFinal%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.html
Ryan Haeri
 
Serbia Presentation Final
Serbia Presentation FinalSerbia Presentation Final
Serbia Presentation Final
goptech
 
Voterfiletrainingpacket
VoterfiletrainingpacketVoterfiletrainingpacket
Voterfiletrainingpacket
Foot Print Strategies Inc.
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
School of Data
 
Excel for Journalists by Steve Doig
Excel for Journalists by Steve DoigExcel for Journalists by Steve Doig
Excel for Journalists by Steve Doig
Reynolds Center for Business Journalism
 
Business Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve DoigBusiness Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve Doig
Reynolds Center for Business Journalism
 
Elections 2013
Elections 2013Elections 2013
Elections 2013
lndata
 
Voter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas AgarwalVoter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas Agarwal
Tejas Garodia
 
2016-05-14 g0v Summit
2016-05-14 g0v Summit2016-05-14 g0v Summit
2016-05-14 g0v Summit
James McKinney
 
Reviewing basic concepts of relational database
Reviewing basic concepts of relational databaseReviewing basic concepts of relational database
Reviewing basic concepts of relational database
Hitesh Mohapatra
 
Using Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open DemocracyUsing Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open Democracy
David Simons
 
What is my neighbourhood like
What is my neighbourhood likeWhat is my neighbourhood like
What is my neighbourhood like
mycommunitylocality
 
What is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collectingWhat is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collecting
Amarni Wood
 
Part 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docxPart 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docx
danhaley45372
 
Introduction to Database Concepts
Introduction to Database ConceptsIntroduction to Database Concepts
Introduction to Database Concepts
Rosalyn Lemieux
 
Political Poster Edit
Political Poster EditPolitical Poster Edit
Political Poster Edit
Clayton Boessen
 

Similar to 2017 Contributing to Open Elections Data using R (20)

Election Project (Elep)
Election Project (Elep)Election Project (Elep)
Election Project (Elep)
 
Analysis of us presidential elections, 2016
  Analysis of us presidential elections, 2016  Analysis of us presidential elections, 2016
Analysis of us presidential elections, 2016
 
Election Project (ELEP)
Election Project (ELEP)Election Project (ELEP)
Election Project (ELEP)
 
Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"
 
Final%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlFinal%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.html
 
Serbia Presentation Final
Serbia Presentation FinalSerbia Presentation Final
Serbia Presentation Final
 
Voterfiletrainingpacket
VoterfiletrainingpacketVoterfiletrainingpacket
Voterfiletrainingpacket
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
Excel for Journalists by Steve Doig
Excel for Journalists by Steve DoigExcel for Journalists by Steve Doig
Excel for Journalists by Steve Doig
 
Business Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve DoigBusiness Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve Doig
 
Elections 2013
Elections 2013Elections 2013
Elections 2013
 
Voter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas AgarwalVoter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas Agarwal
 
2016-05-14 g0v Summit
2016-05-14 g0v Summit2016-05-14 g0v Summit
2016-05-14 g0v Summit
 
Reviewing basic concepts of relational database
Reviewing basic concepts of relational databaseReviewing basic concepts of relational database
Reviewing basic concepts of relational database
 
Using Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open DemocracyUsing Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open Democracy
 
What is my neighbourhood like
What is my neighbourhood likeWhat is my neighbourhood like
What is my neighbourhood like
 
What is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collectingWhat is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collecting
 
Part 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docxPart 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docx
 
Introduction to Database Concepts
Introduction to Database ConceptsIntroduction to Database Concepts
Introduction to Database Concepts
 
Political Poster Edit
Political Poster EditPolitical Poster Edit
Political Poster Edit
 

Recently uploaded

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
74nqk8xf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 

Recently uploaded (20)

Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
一比一原版(Chester毕业证书)切斯特大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 

2017 Contributing to Open Elections Data using R

  • 1. Contributing to OpenElections (Open Data) Using R Rupal Agrawal BARUG Meetup February 2017
  • 2. Background - Election data in the US • Election results are not reported by any single federal agency • Instead, each state & county reports in a variety of formats -- HTML, PDF, CSV, often with very different layouts and varying levels of granularity • Number of elections, besides the Presidential – primaries for each party, mid-term and special for various offices (US Senate, US House, State legislatures, Governor, etc.) • There is no freely available comprehensive source of official election results, for people to use for analysis or journalists for reporting • Article: “Elections: The final frontier of open data?” • https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/ 2
  • 3. About OpenElections • Goal of this Open Data effort “to create the first free, comprehensive, standardized, linked set of election data for the United States, including federal, statewide and state legislative offices” • Website openelections.net (not current, need volunteers) • docs.openelections.net (instructions to contribute) • Github Page (updated regularly) • Contains latest work in progress • Separate repo for each state • Processed data by year, election • Instructions for contributors • Contributed code/scripts mainly in Python • Issue tracking 3 @openelex
  • 4. Motivation • I have been volunteering with OpenElections towards creating such a source • I use R to automate some of these tasks - web-scraping, PDF conversion and for data manipulation to produce the desired outputs in a consistent format • In this lightning talk, using real examples from multiple US states, I will highlight some of the challenges I faced • I will also share some of the R packages I used – RSelenium, XML, pdftools, tabulizer, dplyr, tidyr, data.table aimed to help others wishing to volunteer with similar Open Data efforts 4
  • 5. Desired output format (csv) 1. County 2. Precinct (if available) 3. Office (President, U.S. Senate, U.S. House, State Senate, State House, Attorney General, etc) 4. District (# for U.S. House district or State Senate or State House district) 5. Party (DEM, REP, LIB, OTH…) 6. Candidate (names of candidates) 7. Votes (# of votes received) OpenElections specifies a standardized format for the desired output 5 Output Format
  • 6. Let’s take a look at 4 US States IOWA ALASKA TENNESSEE MISSOURI
  • 8. Iowa 2016 General Election all Races at Precinct-level (txt file) Let’s start with an easy case Sample Input Data is available as text file in Wide format - Data Manipulation only (shown below in Excel for ease of reading)  length(unique(long_DF$RaceTitle))  [1] 197 5000+ columns county+precinct+votetypeoffice+district 8 IOWA
  • 9. Convert Wide file to long file using tidyR package (gather command) so each countyprecinct is in a separate row long_DF <- df %>% gather(countyprecinct, Votes, c(4:5119)) Sample relevant commands (actual code is more elaborate) Challenges along the way • Countyprecinct in input file was separated by “-” but precinct names also contained “-” • Absentee, Polling & Total votes needed to be retained 9 IOWA
  • 10. Split combined columns like RaceTitle and countyprecinct into individual columns separate_DF <- long_DF %>% separate(RaceTitle, c("office", "district"), sep = " Dist. ") separate_DF %>% separate(countyprecinct, c("county", "precinct"), sep = "ZZZZ") cbind(outputT, outputAbs$absentee_votes, outputP$polling_votes) 10 IOWA
  • 12. Another sample file – Alaska 2016 General Election at Precinct-level (data manipulation only) Like IA file shown earlier this is also a csv file but layout is different and new custom code is needed to process it 12 ALASKA
  • 13. Even the same state changes format and layout of results from one election to next Alaska 2012 General Election results in csv are only at District-level (and different layout/columns from 2016) To get precinct-level results, need to process 40 PDFs – one for each county (district) 13 ALASKA
  • 14. Used • pdftools package - pdf_text • Tabulizer package - extract_tables, extract_text Abandoned after trying out variety of ways to get a consistent pattern across multiple pages and files in order that I could extract data via a script 14 ALASKA
  • 16. Tennessee 2004 General Elections votes office candidate party county precinct candidate party=OTH 1 single election results available in 4 distinct PDFs, each with dozens of pages 16 TENNESSEE
  • 17. TENNESSEE district Multiple races in a single PDF Varying Number of candidates per race Determining where a new race has started is not straightforward candidate Click for TN election results website http://sos.tn.gov/products/elections/election-results 17
  • 18. Pseudo code for TN PDF • Download file, read • Convert PDF to free-form text • Find separators for race, page, county • Determine number of races, pages, counties per race • Determine number of candidates per race • Determine number of rows and columns taken up by candidate names • Find number of precincts by race • Tokenize and Compute number of words in each precinct name • Create list of candidates by district • Merge main data frame with candidates df • Remove unwanted rows • Transform and standardize into desired format 18 TENNESSEE txt <- pdf_text(filename) #' Store the whole pdf in one dataframe of 1 column df <- read.csv(textConnection(txt), sep="n", header=F, stringsAsFactors = F) ## Find out how many candidates per Race & how many rows for candidate names ## logic for num_cand is based on number of columns for vote counts ## example, searching for row before "COUNTY" and see 1 2 3 4...and take max ## logic for numrows_col1 is based on count of rows between race name ## & vote count column headers a <- df %>% group_by(Race) %>% mutate(key = grep("COUNTY", V1)[1]-1, #row prior to first match num_cand = as.numeric(max(unlist(strsplit(V1[key], split="")), na.rm=T)), numrows_col1 = key - CANDIDATE_BLK_EXTRA_LINES, # diff = (num_cand == numrows_col1) # catch where num of candidates # is diff from extra rows between race & vote headers ) %>% select(-key)
  • 20. 20 7 Candidates, listed in 2 columns, 5 rows TENNESSEE Candidate names in 4 rows, 3 columns Party handled differently. There is yet another example (not shown) with >10 candidates that a single row (precinct) goes across multiple pages!
  • 21. Wrote a bunch of helper functions like these below Input parameters 21 TENNESSEE
  • 22. Multiple lines for a candidate One of the many interesting challenges along the way # create new df with names of candidates by district c2 <- candidate_list candidate_list <- b %>% group_by(district) %>% slice(2:(numrows_col1 + 1)) %>% select(V1, district, num_cand, numrows_col1) clean_cand <- create_list_candidates_and_numbers(candidate_list) candidate_list1 <- clean_cand %>% separate(Candidate, c("Candidate", "party"), sep = " . ") %>% unite(dist_cand, district, Number, sep = "_Z_", remove = TRUE) Input PDF Appears as 2 candidates! DF Sample code 22 TENNESSEE
  • 24. Missouri 2016 Primary Elections (at county level) - HTML MO 24
  • 25. 100+ counties in dropdown Note: URL doesn’t change with selections 25 MO
  • 27. office candidate party votes district Convert and Transform table raw data into desired format After 100+ html pages extraction and manipulation county-level (not precinct- level) data from 1 election ready! 27 MO
  • 28. remDrv <- remoteDriver(browserName = 'phantomjs') #instantiate new remoteDriver remDrv$open() # open method in remoteDriver class url <- 'http://enrarchives.sos.mo.gov/enrnet/CountyResults.aspx' # Simulate browser session and fill out form remDrv$navigate(url) #send headless browser to url #Select the Election from DROPDOWN using id in xpath elec_xp <- paste0('//*[@id="cboElectionNames"]/option[' , selected_election , ']') remDrv$findElement(using = "xpath", elec_xp)$clickElement() #election is set # ---- Click the button to select the Election eBTN <- '//*[@id="MainContent_btnElectionType"]' remDrv$findElement(using = 'xpath', eBTN)$clickElement() Use RSelenium package to simulate headless browser • Initialize browser session • Go to URL • Select Election name from Dropdown • Click Choose Election button • Select County name from Dropdown • Click Submit button • Get HTML Data for selected Election and County • Process HTML and Extract Table • Convert to Raw Data (readHTMLTable()) • Transform raw data into desired format • Repeat for all counties for that Election ## Get the HTML data from the page and process it using XML package raw <- remDrv$getPageSource()[[1]] counties_val <- xpathSApply(htmlParse(raw), '//*[@id="cboCounty"]/option', xmlAttrs) chosen_county <- grep("selected", counties_val) #Extract the Table (Election results) resTable <- raw %>% readHTMLTable() resDf <- resTable[[1]] # return desired data frame from list of tables 28 MO
  • 29. Conclusions & Takeaways • Great way to learn and contribute • Pdftools – Good package for extracting text data from PDFs • Tabulizer – Useful package for extracting tabular data from PDFs • RSelenium, XML – Great packages for web-scraping with (simulating) forms • Lots of work still needs to be done for recent elections (2000-2016) across all states • 50 states, 100s of input files in a variety of formats per state • Meaningful analysis can be done by data scientists once data is available • Presidential election results gets a lot of attention, but other races are arguably as important 29