SlideShare a Scribd company logo
1 of 30
Download to read offline
Contributing to OpenElections
(Open Data) Using R
Rupal Agrawal
BARUG Meetup
February 2017
Background - Election data in the US
• Election results are not reported by any single federal agency
• Instead, each state & county reports in a variety of formats --
HTML, PDF, CSV, often with very different layouts and varying
levels of granularity
• Number of elections, besides the Presidential – primaries for
each party, mid-term and special for various offices (US Senate,
US House, State legislatures, Governor, etc.)
• There is no freely available comprehensive source of official
election results, for people to use for analysis or journalists for
reporting
• Article: “Elections: The final frontier of open data?”
• https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/
2
About OpenElections
• Goal of this Open Data effort “to create the first free,
comprehensive, standardized, linked set of election data for
the United States, including federal, statewide and state
legislative offices”
• Website openelections.net (not current, need volunteers)
• docs.openelections.net (instructions to contribute)
• Github Page (updated regularly)
• Contains latest work in progress
• Separate repo for each state
• Processed data by year, election
• Instructions for contributors
• Contributed code/scripts mainly in Python
• Issue tracking
3
@openelex
Motivation
• I have been volunteering with OpenElections towards
creating such a source
• I use R to automate some of these tasks - web-scraping, PDF
conversion and for data manipulation to produce the desired
outputs in a consistent format
• In this lightning talk, using real examples from multiple US
states, I will highlight some of the challenges I faced
• I will also share some of the R packages I used – RSelenium,
XML, pdftools, tabulizer, dplyr, tidyr, data.table aimed to help
others wishing to volunteer with similar Open Data efforts
4
Desired output format (csv)
1. County
2. Precinct (if available)
3. Office (President, U.S. Senate, U.S. House, State
Senate, State House, Attorney General, etc)
4. District (# for U.S. House district or State
Senate or State House district)
5. Party (DEM, REP, LIB, OTH…)
6. Candidate (names of candidates)
7. Votes (# of votes received)
OpenElections specifies a standardized format for the desired output
5
Output Format
Let’s take a look at 4 US States
IOWA ALASKA
TENNESSEE
MISSOURI
IOWA
Iowa 2016 General Election all Races at Precinct-level (txt file)
Let’s start with an easy case
Sample Input Data is available as text file in Wide format - Data Manipulation only
(shown below in Excel for ease of reading)
 length(unique(long_DF$RaceTitle))
 [1] 197
5000+
columns
county+precinct+votetypeoffice+district
8
IOWA
Convert Wide file to long file using tidyR package (gather command) so each countyprecinct is in a separate row
long_DF <- df %>% gather(countyprecinct, Votes, c(4:5119))
Sample relevant commands (actual code is more elaborate)
Challenges along the way
• Countyprecinct in input file was
separated by “-” but precinct
names also contained “-”
• Absentee, Polling & Total votes
needed to be retained
9
IOWA
Split combined columns like RaceTitle and countyprecinct into individual columns
separate_DF <- long_DF %>% separate(RaceTitle, c("office", "district"), sep = " Dist. ")
separate_DF %>% separate(countyprecinct, c("county", "precinct"), sep = "ZZZZ")
cbind(outputT, outputAbs$absentee_votes, outputP$polling_votes)
10
IOWA
ALASKA
Another sample file – Alaska 2016 General Election at Precinct-level (data manipulation only)
Like IA file shown earlier
this is also a csv file but
layout is different and
new custom code is
needed to process it
12
ALASKA
Even the same state changes format and layout of results from one election to next
Alaska 2012 General Election results in csv are only at District-level
(and different layout/columns from 2016)
To get precinct-level results, need to process
40 PDFs – one for each county (district)
13
ALASKA
Used
• pdftools package - pdf_text
• Tabulizer package - extract_tables, extract_text
Abandoned after trying out variety of ways to get a consistent pattern across
multiple pages and files in order that I could extract data via a script
14
ALASKA
TENNESSEE
Tennessee 2004 General Elections
votes
office
candidate
party
county
precinct
candidate
party=OTH
1 single election results
available in 4 distinct
PDFs, each with dozens of
pages
16
TENNESSEE
TENNESSEE
district
Multiple races in a single PDF
Varying Number of candidates per race
Determining where a new race has
started is not straightforward
candidate
Click for TN election results website
http://sos.tn.gov/products/elections/election-results
17
Pseudo code for TN PDF
• Download file, read
• Convert PDF to free-form text
• Find separators for race, page, county
• Determine number of races, pages, counties per race
• Determine number of candidates per race
• Determine number of rows and columns taken up by
candidate names
• Find number of precincts by race
• Tokenize and Compute number of words in each
precinct name
• Create list of candidates by district
• Merge main data frame with candidates df
• Remove unwanted rows
• Transform and standardize into desired format 18
TENNESSEE
txt <- pdf_text(filename)
#' Store the whole pdf in one dataframe of 1 column
df <- read.csv(textConnection(txt), sep="n", header=F,
stringsAsFactors = F)
## Find out how many candidates per Race & how many rows for candidate names
## logic for num_cand is based on number of columns for vote counts
## example, searching for row before "COUNTY" and see 1 2 3 4...and take max
## logic for numrows_col1 is based on count of rows between race name
## & vote count column headers
a <- df %>%
group_by(Race) %>%
mutate(key = grep("COUNTY", V1)[1]-1, #row prior to first match
num_cand = as.numeric(max(unlist(strsplit(V1[key], split="")),
na.rm=T)),
numrows_col1 = key - CANDIDATE_BLK_EXTRA_LINES, #
diff = (num_cand == numrows_col1) # catch where num of candidates
# is diff from extra rows between race & vote headers
) %>%
select(-key)
INITIAL DF
INTERMEDIATE DF
19
TENNESSEE
20
7 Candidates, listed in 2
columns, 5 rows
TENNESSEE
Candidate names in 4 rows,
3 columns
Party handled differently.
There is yet another example (not shown) with >10 candidates
that a single row (precinct) goes across multiple pages!
Wrote a bunch of helper functions like these below
Input parameters
21
TENNESSEE
Multiple lines for a candidate
One of the many interesting challenges along the way
# create new df with names of candidates by district
c2 <- candidate_list
candidate_list <- b %>%
group_by(district) %>%
slice(2:(numrows_col1 + 1)) %>%
select(V1, district, num_cand, numrows_col1)
clean_cand <- create_list_candidates_and_numbers(candidate_list)
candidate_list1 <- clean_cand %>%
separate(Candidate, c("Candidate", "party"),
sep = " . ") %>%
unite(dist_cand, district, Number,
sep = "_Z_", remove = TRUE)
Input PDF
Appears as 2
candidates!
DF
Sample code
22
TENNESSEE
MISSOURI
Missouri 2016 Primary Elections (at county level) - HTML MO
24
100+ counties in dropdown
Note: URL doesn’t change with selections
25
MO
county
26
MO
office
candidate party votes
district
Convert and Transform table raw data into desired format
After 100+ html pages extraction and
manipulation county-level (not precinct-
level) data from 1 election ready!
27
MO
remDrv <- remoteDriver(browserName = 'phantomjs') #instantiate new
remoteDriver
remDrv$open() # open method in remoteDriver class
url <- 'http://enrarchives.sos.mo.gov/enrnet/CountyResults.aspx'
# Simulate browser session and fill out form
remDrv$navigate(url) #send headless browser to url
#Select the Election from DROPDOWN using id in xpath
elec_xp <- paste0('//*[@id="cboElectionNames"]/option[' ,
selected_election , ']')
remDrv$findElement(using = "xpath", elec_xp)$clickElement()
#election is set
# ---- Click the button to select the Election
eBTN <- '//*[@id="MainContent_btnElectionType"]'
remDrv$findElement(using = 'xpath', eBTN)$clickElement()
Use RSelenium package to simulate headless browser
• Initialize browser session
• Go to URL
• Select Election name from Dropdown
• Click Choose Election button
• Select County name from Dropdown
• Click Submit button
• Get HTML Data for selected Election and County
• Process HTML and Extract Table
• Convert to Raw Data (readHTMLTable())
• Transform raw data into desired format
• Repeat for all counties for that Election
## Get the HTML data from the page and process it using XML package
raw <- remDrv$getPageSource()[[1]]
counties_val <- xpathSApply(htmlParse(raw),
'//*[@id="cboCounty"]/option', xmlAttrs)
chosen_county <- grep("selected", counties_val)
#Extract the Table (Election results)
resTable <- raw %>% readHTMLTable()
resDf <- resTable[[1]] # return desired data frame from list
of tables
28
MO
Conclusions & Takeaways
• Great way to learn and contribute
• Pdftools – Good package for extracting text data from PDFs
• Tabulizer – Useful package for extracting tabular data from PDFs
• RSelenium, XML – Great packages for web-scraping with (simulating) forms
• Lots of work still needs to be done for recent elections (2000-2016) across
all states
• 50 states, 100s of input files in a variety of formats per state
• Meaningful analysis can be done by data scientists once data is available
• Presidential election results gets a lot of attention, but other races are
arguably as important
29
Questions?
Rupal Agrawal
rupal_agrawal@yahoo.com
30
@openelex
docs.openelections.net
Info on OpenElections:
https://github.com/openelections

More Related Content

Viewers also liked

Open Local Data Presentation
Open Local Data PresentationOpen Local Data Presentation
Open Local Data PresentationChris Taggart
 
Previous project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and JudgesPrevious project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and Judgessathishkumar supermaniam
 
Group assigment statistic group3
Group assigment statistic group3Group assigment statistic group3
Group assigment statistic group3Narith Por
 
Produccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2AProduccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2AGaelmontano41
 
Blame 032
Blame 032Blame 032
Blame 032comicgo
 
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...Amir Rezazadeh
 
Enrichment statistics project
Enrichment statistics projectEnrichment statistics project
Enrichment statistics projectElizabeth Walker
 
Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2Jean-Paul Thomas
 
Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4Jean-Paul Thomas
 
Ap statistics final project
Ap statistics final projectAp statistics final project
Ap statistics final projecteseuwhu1
 
Programazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbhProgramazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbhtrutxete
 
Statistic project 22
Statistic project 22Statistic project 22
Statistic project 22Jenny Lee
 
Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.Bill Barrick
 

Viewers also liked (17)

Open Local Data Presentation
Open Local Data PresentationOpen Local Data Presentation
Open Local Data Presentation
 
Previous project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and JudgesPrevious project Statistics ,Sponsors and Judges
Previous project Statistics ,Sponsors and Judges
 
Group assigment statistic group3
Group assigment statistic group3Group assigment statistic group3
Group assigment statistic group3
 
Produccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2AProduccion y desarrollo sustentable 2A
Produccion y desarrollo sustentable 2A
 
Blame 032
Blame 032Blame 032
Blame 032
 
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
Effect of Salinity on Phenolic Composition and Antioxidant Activity of Artich...
 
Alba Resumé
Alba ResuméAlba Resumé
Alba Resumé
 
Enrichment statistics project
Enrichment statistics projectEnrichment statistics project
Enrichment statistics project
 
Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2Master veille informationnelle_2016-2017_partie2
Master veille informationnelle_2016-2017_partie2
 
Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4Master veille informationnelle_2016-2017_partie4
Master veille informationnelle_2016-2017_partie4
 
Jorge Enrique Adoum
Jorge Enrique AdoumJorge Enrique Adoum
Jorge Enrique Adoum
 
Ap statistics final project
Ap statistics final projectAp statistics final project
Ap statistics final project
 
Programazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbhProgramazio didaktikoak lh eta dbh
Programazio didaktikoak lh eta dbh
 
Statistic project 22
Statistic project 22Statistic project 22
Statistic project 22
 
100 mambos and merengues
100 mambos and merengues100 mambos and merengues
100 mambos and merengues
 
Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.Sawdust Art Festival - Marketing Communication Proposal.
Sawdust Art Festival - Marketing Communication Proposal.
 
ABBA Gold
ABBA GoldABBA Gold
ABBA Gold
 

Similar to 2017 Contributing to Open Elections Data using R

Election Project (Elep)
Election Project (Elep)Election Project (Elep)
Election Project (Elep)datamap.io
 
Analysis of us presidential elections, 2016
  Analysis of us presidential elections, 2016  Analysis of us presidential elections, 2016
Analysis of us presidential elections, 2016Tapan Saxena
 
Election Project (ELEP)
Election Project (ELEP)Election Project (ELEP)
Election Project (ELEP)datamap.io
 
Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"Lynda Kellam
 
Final%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlFinal%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlRyan Haeri
 
Serbia Presentation Final
Serbia Presentation FinalSerbia Presentation Final
Serbia Presentation Finalgoptech
 
Business Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve DoigBusiness Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve DoigReynolds Center for Business Journalism
 
Elections 2013
Elections 2013Elections 2013
Elections 2013lndata
 
Voter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas AgarwalVoter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas AgarwalTejas Garodia
 
Reviewing basic concepts of relational database
Reviewing basic concepts of relational databaseReviewing basic concepts of relational database
Reviewing basic concepts of relational databaseHitesh Mohapatra
 
Using Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open DemocracyUsing Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open DemocracyDavid Simons
 
What is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collectingWhat is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collectingAmarni Wood
 
Part 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docxPart 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docxdanhaley45372
 
Introduction to Database Concepts
Introduction to Database ConceptsIntroduction to Database Concepts
Introduction to Database ConceptsRosalyn Lemieux
 

Similar to 2017 Contributing to Open Elections Data using R (20)

Election Project (Elep)
Election Project (Elep)Election Project (Elep)
Election Project (Elep)
 
Analysis of us presidential elections, 2016
  Analysis of us presidential elections, 2016  Analysis of us presidential elections, 2016
Analysis of us presidential elections, 2016
 
Election Project (ELEP)
Election Project (ELEP)Election Project (ELEP)
Election Project (ELEP)
 
Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"Help! Webinar: "Making Election Data Great Again"
Help! Webinar: "Making Election Data Great Again"
 
Final%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlFinal%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.html
 
Serbia Presentation Final
Serbia Presentation FinalSerbia Presentation Final
Serbia Presentation Final
 
Voterfiletrainingpacket
VoterfiletrainingpacketVoterfiletrainingpacket
Voterfiletrainingpacket
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
Excel for Journalists by Steve Doig
Excel for Journalists by Steve DoigExcel for Journalists by Steve Doig
Excel for Journalists by Steve Doig
 
Business Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve DoigBusiness Journalism Professors 2014: Excel for Journalists by Steve Doig
Business Journalism Professors 2014: Excel for Journalists by Steve Doig
 
Elections 2013
Elections 2013Elections 2013
Elections 2013
 
Voter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas AgarwalVoter Management System Report - Tejas Agarwal
Voter Management System Report - Tejas Agarwal
 
2016-05-14 g0v Summit
2016-05-14 g0v Summit2016-05-14 g0v Summit
2016-05-14 g0v Summit
 
Reviewing basic concepts of relational database
Reviewing basic concepts of relational databaseReviewing basic concepts of relational database
Reviewing basic concepts of relational database
 
Using Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open DemocracyUsing Clojure to Marry Neo4j and Open Democracy
Using Clojure to Marry Neo4j and Open Democracy
 
What is my neighbourhood like
What is my neighbourhood likeWhat is my neighbourhood like
What is my neighbourhood like
 
What is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collectingWhat is my neighbourhood like: Data collecting
What is my neighbourhood like: Data collecting
 
Part 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docxPart 1 Individual Factors Affecting Voter Turnout Based on .docx
Part 1 Individual Factors Affecting Voter Turnout Based on .docx
 
Introduction to Database Concepts
Introduction to Database ConceptsIntroduction to Database Concepts
Introduction to Database Concepts
 
Political Poster Edit
Political Poster EditPolitical Poster Edit
Political Poster Edit
 

Recently uploaded

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

2017 Contributing to Open Elections Data using R

  • 1. Contributing to OpenElections (Open Data) Using R Rupal Agrawal BARUG Meetup February 2017
  • 2. Background - Election data in the US • Election results are not reported by any single federal agency • Instead, each state & county reports in a variety of formats -- HTML, PDF, CSV, often with very different layouts and varying levels of granularity • Number of elections, besides the Presidential – primaries for each party, mid-term and special for various offices (US Senate, US House, State legislatures, Governor, etc.) • There is no freely available comprehensive source of official election results, for people to use for analysis or journalists for reporting • Article: “Elections: The final frontier of open data?” • https://sunlightfoundation.com/2015/02/27/elections-the-final-frontier-of-open-data/ 2
  • 3. About OpenElections • Goal of this Open Data effort “to create the first free, comprehensive, standardized, linked set of election data for the United States, including federal, statewide and state legislative offices” • Website openelections.net (not current, need volunteers) • docs.openelections.net (instructions to contribute) • Github Page (updated regularly) • Contains latest work in progress • Separate repo for each state • Processed data by year, election • Instructions for contributors • Contributed code/scripts mainly in Python • Issue tracking 3 @openelex
  • 4. Motivation • I have been volunteering with OpenElections towards creating such a source • I use R to automate some of these tasks - web-scraping, PDF conversion and for data manipulation to produce the desired outputs in a consistent format • In this lightning talk, using real examples from multiple US states, I will highlight some of the challenges I faced • I will also share some of the R packages I used – RSelenium, XML, pdftools, tabulizer, dplyr, tidyr, data.table aimed to help others wishing to volunteer with similar Open Data efforts 4
  • 5. Desired output format (csv) 1. County 2. Precinct (if available) 3. Office (President, U.S. Senate, U.S. House, State Senate, State House, Attorney General, etc) 4. District (# for U.S. House district or State Senate or State House district) 5. Party (DEM, REP, LIB, OTH…) 6. Candidate (names of candidates) 7. Votes (# of votes received) OpenElections specifies a standardized format for the desired output 5 Output Format
  • 6. Let’s take a look at 4 US States IOWA ALASKA TENNESSEE MISSOURI
  • 8. Iowa 2016 General Election all Races at Precinct-level (txt file) Let’s start with an easy case Sample Input Data is available as text file in Wide format - Data Manipulation only (shown below in Excel for ease of reading)  length(unique(long_DF$RaceTitle))  [1] 197 5000+ columns county+precinct+votetypeoffice+district 8 IOWA
  • 9. Convert Wide file to long file using tidyR package (gather command) so each countyprecinct is in a separate row long_DF <- df %>% gather(countyprecinct, Votes, c(4:5119)) Sample relevant commands (actual code is more elaborate) Challenges along the way • Countyprecinct in input file was separated by “-” but precinct names also contained “-” • Absentee, Polling & Total votes needed to be retained 9 IOWA
  • 10. Split combined columns like RaceTitle and countyprecinct into individual columns separate_DF <- long_DF %>% separate(RaceTitle, c("office", "district"), sep = " Dist. ") separate_DF %>% separate(countyprecinct, c("county", "precinct"), sep = "ZZZZ") cbind(outputT, outputAbs$absentee_votes, outputP$polling_votes) 10 IOWA
  • 12. Another sample file – Alaska 2016 General Election at Precinct-level (data manipulation only) Like IA file shown earlier this is also a csv file but layout is different and new custom code is needed to process it 12 ALASKA
  • 13. Even the same state changes format and layout of results from one election to next Alaska 2012 General Election results in csv are only at District-level (and different layout/columns from 2016) To get precinct-level results, need to process 40 PDFs – one for each county (district) 13 ALASKA
  • 14. Used • pdftools package - pdf_text • Tabulizer package - extract_tables, extract_text Abandoned after trying out variety of ways to get a consistent pattern across multiple pages and files in order that I could extract data via a script 14 ALASKA
  • 16. Tennessee 2004 General Elections votes office candidate party county precinct candidate party=OTH 1 single election results available in 4 distinct PDFs, each with dozens of pages 16 TENNESSEE
  • 17. TENNESSEE district Multiple races in a single PDF Varying Number of candidates per race Determining where a new race has started is not straightforward candidate Click for TN election results website http://sos.tn.gov/products/elections/election-results 17
  • 18. Pseudo code for TN PDF • Download file, read • Convert PDF to free-form text • Find separators for race, page, county • Determine number of races, pages, counties per race • Determine number of candidates per race • Determine number of rows and columns taken up by candidate names • Find number of precincts by race • Tokenize and Compute number of words in each precinct name • Create list of candidates by district • Merge main data frame with candidates df • Remove unwanted rows • Transform and standardize into desired format 18 TENNESSEE txt <- pdf_text(filename) #' Store the whole pdf in one dataframe of 1 column df <- read.csv(textConnection(txt), sep="n", header=F, stringsAsFactors = F) ## Find out how many candidates per Race & how many rows for candidate names ## logic for num_cand is based on number of columns for vote counts ## example, searching for row before "COUNTY" and see 1 2 3 4...and take max ## logic for numrows_col1 is based on count of rows between race name ## & vote count column headers a <- df %>% group_by(Race) %>% mutate(key = grep("COUNTY", V1)[1]-1, #row prior to first match num_cand = as.numeric(max(unlist(strsplit(V1[key], split="")), na.rm=T)), numrows_col1 = key - CANDIDATE_BLK_EXTRA_LINES, # diff = (num_cand == numrows_col1) # catch where num of candidates # is diff from extra rows between race & vote headers ) %>% select(-key)
  • 20. 20 7 Candidates, listed in 2 columns, 5 rows TENNESSEE Candidate names in 4 rows, 3 columns Party handled differently. There is yet another example (not shown) with >10 candidates that a single row (precinct) goes across multiple pages!
  • 21. Wrote a bunch of helper functions like these below Input parameters 21 TENNESSEE
  • 22. Multiple lines for a candidate One of the many interesting challenges along the way # create new df with names of candidates by district c2 <- candidate_list candidate_list <- b %>% group_by(district) %>% slice(2:(numrows_col1 + 1)) %>% select(V1, district, num_cand, numrows_col1) clean_cand <- create_list_candidates_and_numbers(candidate_list) candidate_list1 <- clean_cand %>% separate(Candidate, c("Candidate", "party"), sep = " . ") %>% unite(dist_cand, district, Number, sep = "_Z_", remove = TRUE) Input PDF Appears as 2 candidates! DF Sample code 22 TENNESSEE
  • 24. Missouri 2016 Primary Elections (at county level) - HTML MO 24
  • 25. 100+ counties in dropdown Note: URL doesn’t change with selections 25 MO
  • 27. office candidate party votes district Convert and Transform table raw data into desired format After 100+ html pages extraction and manipulation county-level (not precinct- level) data from 1 election ready! 27 MO
  • 28. remDrv <- remoteDriver(browserName = 'phantomjs') #instantiate new remoteDriver remDrv$open() # open method in remoteDriver class url <- 'http://enrarchives.sos.mo.gov/enrnet/CountyResults.aspx' # Simulate browser session and fill out form remDrv$navigate(url) #send headless browser to url #Select the Election from DROPDOWN using id in xpath elec_xp <- paste0('//*[@id="cboElectionNames"]/option[' , selected_election , ']') remDrv$findElement(using = "xpath", elec_xp)$clickElement() #election is set # ---- Click the button to select the Election eBTN <- '//*[@id="MainContent_btnElectionType"]' remDrv$findElement(using = 'xpath', eBTN)$clickElement() Use RSelenium package to simulate headless browser • Initialize browser session • Go to URL • Select Election name from Dropdown • Click Choose Election button • Select County name from Dropdown • Click Submit button • Get HTML Data for selected Election and County • Process HTML and Extract Table • Convert to Raw Data (readHTMLTable()) • Transform raw data into desired format • Repeat for all counties for that Election ## Get the HTML data from the page and process it using XML package raw <- remDrv$getPageSource()[[1]] counties_val <- xpathSApply(htmlParse(raw), '//*[@id="cboCounty"]/option', xmlAttrs) chosen_county <- grep("selected", counties_val) #Extract the Table (Election results) resTable <- raw %>% readHTMLTable() resDf <- resTable[[1]] # return desired data frame from list of tables 28 MO
  • 29. Conclusions & Takeaways • Great way to learn and contribute • Pdftools – Good package for extracting text data from PDFs • Tabulizer – Useful package for extracting tabular data from PDFs • RSelenium, XML – Great packages for web-scraping with (simulating) forms • Lots of work still needs to be done for recent elections (2000-2016) across all states • 50 states, 100s of input files in a variety of formats per state • Meaningful analysis can be done by data scientists once data is available • Presidential election results gets a lot of attention, but other races are arguably as important 29