SlideShare a Scribd company logo
1 of 24
EXTRACTING & ANALYZING
DATA FROM MUNICIPAL
FINANCIAL DISCLOSURES
Marc Joffe
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
Extracting and Analyzing Data
from Municipal Financial
Disclosures
Marc Joffe
Public Sector Credit Solutions
Open Data Science Conference
Boston, May 2015
The Research Question
• How is the cost of funding public employee pensions affecting
California cities?
• I hoped to answer the question by gathering pension expenditure
data for all cities in the state.
• Main data points:
• Current and future contribution amounts
• Funded ratio
Data on City Pensions
• The best sources for information on local government pension costs
are (1) the municipality’s audited financial statements (CAFRs) and (2)
actuarial valuation reports published by the pension fund.
• In California (and some other states), most cities rely on a multi-
employer pension system. The system in California, CalPERS,
publishes one actuarial report for each local government pension
plan it administers – about 3000 in all.
• I was just interested in the roughly 1400 plans covering city
employees. CalPERS publishes a unique PDF for each plan.
• The main challenge is thus to get the 1400 PDFs and extract key data
points (such as future actuarially required contributions) from them.
Gathering the Pension Data (1 of 2)
• Found a web page that had links to all the actuarial valuation PDFs.
• In this case: http://www.calpers.ca.gov/index.jsp?bc=/about/forms-
pubs/calpers-reports/actuarial-reports/home.xml
• Downloaded this page and scraped all the links
• This can be done with a python script (ideally leveraging an HTML processing
library like BeautifulSoup) or by copying/pasting to Excel. When copying
content from a web page to Excel, it is better to use Internet Explorer than
other browsers.
• Ran a command line script to download all the links. This shell script
or windows command file can use curl or wget to retrieve the PDFs.
Gathering the Pension Data (2 of 2)
• Because the valuation PDFs have embedded text, no OCR was
necessary. I pulled out the text with Poppler’s pdftotext command
line executable, using the –layout option to make the outputs more
readable.
• Because the PDFs had very consistent formats (they appear to have
been output by a report generator), I could take advantage of
patterns in the text. I wrote Python scripts to read each file and
extract just the portions I needed. I output the strings I captured to a
CSV file.
• I loaded the CSV file into Excel for further analysis.
Answering the “So What Question” with Revenue Data
• The raw pension numbers are not that interesting unless placed into
some context. I wanted to calculate the ratio of pension costs to total
revenue for each city because that is a fiscal health measure. A
ranking of cities by this measure is interesting – especially to cities
near the top of the ranking!
• The actuarial valuation reports provide actuarially required
contributions for the upcoming fiscal year. I could get revenue data
from CAFRs but these are published on a delayed basis.
• A more timely source proved to be a data set provided by the State
Controller via a Socrata Open Data platform. See
http://bythenumbers.sco.ca.gov.
Mashing up the Data and Analyzing
• I now had two data sets: pension costs and revenues.
• The remaining steps needed to calculate the pension cost/revenue ratios
are as follows:
• Add up all the plans for each city to get total city pension costs.
• Map the city names in the CalPERS data set to the city names in the State Controller
data set. This was generally straightforward, but there were a couple of oddities
(such as Paso Robles = El Paso de Robles)
• Using the common key (i.e., standardized city name), combine the two data sets
• Calculate the ratio
• Sort in descending order
• I did the above in Excel and Google Sheets. I could have used Python or
another scripting language but I find spreadsheets easier.
The Results….
Our next project: govwiki.us
URL: http://govwiki.us
Repo: https://github.com/govwiki/govwiki.us
Online database of all US local governments.
• Obtained a list of 91,000 local governments from
the US census
• Performed rough geocoding
• Now gathering additional data from public
sources in California
• Hope to launch in August
• Also hope to create a Wikipedia interface
• Environment: MySQL, Node.js, Coffeescript
Original PDF Liberation Presentation – 1/2014
• In January 2014, I worked with the Sunlight Foundation to host the
“PDF Liberation Hackathon” in New York, Washington, Chicago and
San Francisco.
• A list of PDF extraction solutions and sample PDF extraction problems
available at: http://pdfliberation.wordpress.com/
• Following are some slides related to that event
An Example of How PDF Liberation Can
Generate News
• Working with Mortgage Resolution Partners, the City of Richmond has
proposed to use its power of eminent domain to refinance mortgages
for underwater homeowners
• In July, the media reported that 624 properties had been chosen
• I wanted to know which ones, so I filed a California Public Records Act
request . . .
The Request…(Make it Very Specific)
Dear Ms. Holmes,
Pursuant to my rights under the California Public Records Act (Government Code Section 6250 et seq.), I ask to obtain a copy of the following, which I understand to be held by your agency:
Attachments A, B and C to letters sent to mortgage servicers offering to purchase mortgage loans dated on or about July 31, 2013. The form letter is available on the internet at
http://www.contracostatimes.com/west-county-times/ci_23760190/document-city-richmond-letter-mortgage-lenders?source=pkg. I understand that 32 such letters have been sent, so this request
involves as many as 96 unique documents.
The purpose of this request is to obtain a list of 624 mortgages which Richmond is offering to purchase containing the property addresses, mortgage amounts, appraised values, servicer names, and, if
possible, the name of the Residential Mortgage Backed Securities (RMBS) deal holding each mortgage. If you can provide this listing in a more concise format, I will accept it in lieu of the attachments
described in the previous paragraph.
I ask for a determination on this request within 10 days of your receipt of it, and an even prompter reply if you can make that determination without having to review the record[s] in question.
If you determine that some but not all of the information is exempt from disclosure and that you intend to withhold it, I ask that you redact it for the time being and make the rest available as
requested.
In any event, please provide a signed notification citing the legal authorities on which you rely if you determine that any or all of the information is exempt and will not be disclosed.
If I can provide any clarification that will help expedite your attention to my request, please contact me by phone at 415-578-0558 or by email at marc@publicsectorcredit.org. I ask that the requested
documents be sent to be in electronic format via return email. If you must provide paper documents, I ask that you notify me of any duplication costs exceeding $50 before you duplicate the records so
that I may decide which records I want copied. I can visit your office to collect the documents once they have been duplicated.
Thank you for your time and attention to this matter.
Sincerely,
Marc D. Joffe
1655 North California Blvd. Unit 162
Walnut Creek, CA 94596
The Response…
• Four PDFs
Processing
• Loaded the four PDFs into Able2Extract – a commercial PDF conversion tool that
costs about $100*
• Converted the PDFs to Microsoft Excel
• I had now had multiple lists of properties with different fields
• I sorted the lists into the same order and then joined them together into one
master spreadsheet
• I found that three properties had mortgage balances over $800,000 and was able
to connect the balances to the addresses
• This made it possible to map the properties and to see the houses themselves on
Google Street View
* Tabula, an open source tool, is reaching the point at which it could perform the same function.
The Results …
• Lead story in the business section of the Chronicle
• Wall Street Journal blog post
• Finding raised at City Council meeting
• In December, Mayor Gayle McLaughlin altered the program to
exclude mortgages above the conforming loan limit ($729,500)
and to focus on blighted neighborhoods.
By the way:
The owner of the house on the right was apparently unaware
that her home had been included in the program. So my initial
theory that this had been a case of cronyism was not borne out.
Some of Our Challenges
• Government Financial Statements
• IRS Form 990s (Non-Profit Disclosures)
• House of Representative Financial Disclosures
• Compiling a History of Torture
Government Financial Statements:
Finding the Next Detroit
IRS Form
990s:
Finding
members
of the 1%
who work
at not-for-
profits
. . . And
finding the 1%
in Congress by
dissecting
House
Financial
Disclosures
This project was taken on by our second place prize winner. Their best results came from using Captricty.com.
Documenting a History of Torture: Parsing
Amnesty International Annual Reports
This project was taken on by our first place prize winner.
Three Inter-Related Problems …
• Extracting data from PDFs that contain embedded text
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
… and some Open Source Solutions
• Extracting data from PDFs that contain embedded text
PDFBox, Poppler
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
Tesseract
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
Tabula (for table identification), OpenRefine
… or Licensed Solutions
• Extracting data from PDFs that contain embedded text
PDFLib Text Extraction Tool
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
ABBYY (FineReader or Cloud SDK)
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
SIMX Text Converter

More Related Content

What's hot

Babcock Work Sample.v3
Babcock Work Sample.v3Babcock Work Sample.v3
Babcock Work Sample.v3Seann Smith, AICP
 
Mpa 503 Education Specialist-snaptutorial.com
Mpa 503 Education Specialist-snaptutorial.comMpa 503 Education Specialist-snaptutorial.com
Mpa 503 Education Specialist-snaptutorial.comrobertlesew77
 
2600 v07 n4 (winter 1990)
2600 v07 n4 (winter 1990)2600 v07 n4 (winter 1990)
2600 v07 n4 (winter 1990)Felipe Prado
 
Esri US Data Fact Sheet
Esri US Data Fact SheetEsri US Data Fact Sheet
Esri US Data Fact SheetEsri
 

What's hot (6)

Babcock Work Sample.v3
Babcock Work Sample.v3Babcock Work Sample.v3
Babcock Work Sample.v3
 
Babcock work-samplev5
Babcock work-samplev5Babcock work-samplev5
Babcock work-samplev5
 
Mpa 503 Education Specialist-snaptutorial.com
Mpa 503 Education Specialist-snaptutorial.comMpa 503 Education Specialist-snaptutorial.com
Mpa 503 Education Specialist-snaptutorial.com
 
Overview
OverviewOverview
Overview
 
2600 v07 n4 (winter 1990)
2600 v07 n4 (winter 1990)2600 v07 n4 (winter 1990)
2600 v07 n4 (winter 1990)
 
Esri US Data Fact Sheet
Esri US Data Fact SheetEsri US Data Fact Sheet
Esri US Data Fact Sheet
 

Similar to Analyzing Municipal Financial Data

Show Me the Money
Show Me the MoneyShow Me the Money
Show Me the MoneyNathanASmith
 
Friending The Statehouse
Friending The StatehouseFriending The Statehouse
Friending The StatehouseMark Headd
 
Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...
Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...
Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...Marc Joffe
 
Toward an XBRL taxonomy for CAFRs
Toward an XBRL taxonomy for CAFRsToward an XBRL taxonomy for CAFRs
Toward an XBRL taxonomy for CAFRsMarc Joffe
 
Using the Web to Find Local Business & Market Information handout
Using the Web to Find Local Business & Market Information  handoutUsing the Web to Find Local Business & Market Information  handout
Using the Web to Find Local Business & Market Information handoutMarcy Phelps
 
Gov Whitepaper Book 2
Gov Whitepaper Book 2Gov Whitepaper Book 2
Gov Whitepaper Book 2Dan Erker
 
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...MongoDB
 
Government Linked Data: A Tipping Point for the Semantic Web
Government Linked Data: A Tipping Point for the Semantic WebGovernment Linked Data: A Tipping Point for the Semantic Web
Government Linked Data: A Tipping Point for the Semantic WebNigel Shadbolt
 
Archana Pradhan | Using Data for Advocacy
Archana Pradhan | Using Data for AdvocacyArchana Pradhan | Using Data for Advocacy
Archana Pradhan | Using Data for AdvocacyElyk Venture Management
 
Open data for UK public sector organisations
Open data for UK public sector organisationsOpen data for UK public sector organisations
Open data for UK public sector organisationsAndrew Mackenzie
 
Digital Communities Article 2010
Digital Communities Article 2010Digital Communities Article 2010
Digital Communities Article 2010Kristin Judge
 
This assignment covers chapter 8 and is due by 1000 p.m on Monday.docx
This assignment covers chapter 8 and is due by 1000 p.m on Monday.docxThis assignment covers chapter 8 and is due by 1000 p.m on Monday.docx
This assignment covers chapter 8 and is due by 1000 p.m on Monday.docxchristalgrieg
 
Figuring out Community Return on Public Investment in Broadband
Figuring out Community Return on Public Investment in BroadbandFiguring out Community Return on Public Investment in Broadband
Figuring out Community Return on Public Investment in BroadbandAnn Treacy
 
TAP Summit Pilot Project: Budget
TAP Summit Pilot Project: BudgetTAP Summit Pilot Project: Budget
TAP Summit Pilot Project: BudgetJillmz
 
Deanna’s Input for Question 3As Chief Financial Management Of.docx
Deanna’s Input for Question 3As Chief Financial Management Of.docxDeanna’s Input for Question 3As Chief Financial Management Of.docx
Deanna’s Input for Question 3As Chief Financial Management Of.docxtheodorelove43763
 
Developing a data mindset to improve stories every day - Brant Houston - Illi...
Developing a data mindset to improve stories every day - Brant Houston - Illi...Developing a data mindset to improve stories every day - Brant Houston - Illi...
Developing a data mindset to improve stories every day - Brant Houston - Illi...News Leaders Association's NewsTrain
 
Forging a federal government open data agenda by liv watson
Forging a federal government open data agenda by liv watsonForging a federal government open data agenda by liv watson
Forging a federal government open data agenda by liv watsonWorkiva
 
Essay On Poverty Fosters Crime
Essay On Poverty Fosters CrimeEssay On Poverty Fosters Crime
Essay On Poverty Fosters CrimeShantel Jervey
 

Similar to Analyzing Municipal Financial Data (20)

Show Me the Money
Show Me the MoneyShow Me the Money
Show Me the Money
 
Boston
BostonBoston
Boston
 
Friending The Statehouse
Friending The StatehouseFriending The Statehouse
Friending The Statehouse
 
Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...
Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...
Transparency Camp: Collecting and Analyzing Local Government Financial Disclo...
 
Toward an XBRL taxonomy for CAFRs
Toward an XBRL taxonomy for CAFRsToward an XBRL taxonomy for CAFRs
Toward an XBRL taxonomy for CAFRs
 
Using the Web to Find Local Business & Market Information handout
Using the Web to Find Local Business & Market Information  handoutUsing the Web to Find Local Business & Market Information  handout
Using the Web to Find Local Business & Market Information handout
 
Gov Whitepaper Book 2
Gov Whitepaper Book 2Gov Whitepaper Book 2
Gov Whitepaper Book 2
 
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
Webinar: How Penton Uses MongoDB As an Analytics Platform within their Drupal...
 
Government Linked Data: A Tipping Point for the Semantic Web
Government Linked Data: A Tipping Point for the Semantic WebGovernment Linked Data: A Tipping Point for the Semantic Web
Government Linked Data: A Tipping Point for the Semantic Web
 
Archana Pradhan | Using Data for Advocacy
Archana Pradhan | Using Data for AdvocacyArchana Pradhan | Using Data for Advocacy
Archana Pradhan | Using Data for Advocacy
 
Open data for UK public sector organisations
Open data for UK public sector organisationsOpen data for UK public sector organisations
Open data for UK public sector organisations
 
Digital Communities Article 2010
Digital Communities Article 2010Digital Communities Article 2010
Digital Communities Article 2010
 
This assignment covers chapter 8 and is due by 1000 p.m on Monday.docx
This assignment covers chapter 8 and is due by 1000 p.m on Monday.docxThis assignment covers chapter 8 and is due by 1000 p.m on Monday.docx
This assignment covers chapter 8 and is due by 1000 p.m on Monday.docx
 
Figuring out Community Return on Public Investment in Broadband
Figuring out Community Return on Public Investment in BroadbandFiguring out Community Return on Public Investment in Broadband
Figuring out Community Return on Public Investment in Broadband
 
Pmay ppt
Pmay pptPmay ppt
Pmay ppt
 
TAP Summit Pilot Project: Budget
TAP Summit Pilot Project: BudgetTAP Summit Pilot Project: Budget
TAP Summit Pilot Project: Budget
 
Deanna’s Input for Question 3As Chief Financial Management Of.docx
Deanna’s Input for Question 3As Chief Financial Management Of.docxDeanna’s Input for Question 3As Chief Financial Management Of.docx
Deanna’s Input for Question 3As Chief Financial Management Of.docx
 
Developing a data mindset to improve stories every day - Brant Houston - Illi...
Developing a data mindset to improve stories every day - Brant Houston - Illi...Developing a data mindset to improve stories every day - Brant Houston - Illi...
Developing a data mindset to improve stories every day - Brant Houston - Illi...
 
Forging a federal government open data agenda by liv watson
Forging a federal government open data agenda by liv watsonForging a federal government open data agenda by liv watson
Forging a federal government open data agenda by liv watson
 
Essay On Poverty Fosters Crime
Essay On Poverty Fosters CrimeEssay On Poverty Fosters Crime
Essay On Poverty Fosters Crime
 

More from odsc

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer odsc
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discoveryodsc
 
API Driven Development
API Driven Development API Driven Development
API Driven Development odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysisodsc
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Upodsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hiveodsc
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depthodsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Informationodsc
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLodsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Namesodsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Dataodsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Scienceodsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Toolsodsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypseodsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Researchodsc
 

More from odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 

Recently uploaded

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Analyzing Municipal Financial Data

  • 1. EXTRACTING & ANALYZING DATA FROM MUNICIPAL FINANCIAL DISCLOSURES Marc Joffe O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 2. Extracting and Analyzing Data from Municipal Financial Disclosures Marc Joffe Public Sector Credit Solutions Open Data Science Conference Boston, May 2015
  • 3. The Research Question • How is the cost of funding public employee pensions affecting California cities? • I hoped to answer the question by gathering pension expenditure data for all cities in the state. • Main data points: • Current and future contribution amounts • Funded ratio
  • 4. Data on City Pensions • The best sources for information on local government pension costs are (1) the municipality’s audited financial statements (CAFRs) and (2) actuarial valuation reports published by the pension fund. • In California (and some other states), most cities rely on a multi- employer pension system. The system in California, CalPERS, publishes one actuarial report for each local government pension plan it administers – about 3000 in all. • I was just interested in the roughly 1400 plans covering city employees. CalPERS publishes a unique PDF for each plan. • The main challenge is thus to get the 1400 PDFs and extract key data points (such as future actuarially required contributions) from them.
  • 5. Gathering the Pension Data (1 of 2) • Found a web page that had links to all the actuarial valuation PDFs. • In this case: http://www.calpers.ca.gov/index.jsp?bc=/about/forms- pubs/calpers-reports/actuarial-reports/home.xml • Downloaded this page and scraped all the links • This can be done with a python script (ideally leveraging an HTML processing library like BeautifulSoup) or by copying/pasting to Excel. When copying content from a web page to Excel, it is better to use Internet Explorer than other browsers. • Ran a command line script to download all the links. This shell script or windows command file can use curl or wget to retrieve the PDFs.
  • 6. Gathering the Pension Data (2 of 2) • Because the valuation PDFs have embedded text, no OCR was necessary. I pulled out the text with Poppler’s pdftotext command line executable, using the –layout option to make the outputs more readable. • Because the PDFs had very consistent formats (they appear to have been output by a report generator), I could take advantage of patterns in the text. I wrote Python scripts to read each file and extract just the portions I needed. I output the strings I captured to a CSV file. • I loaded the CSV file into Excel for further analysis.
  • 7. Answering the “So What Question” with Revenue Data • The raw pension numbers are not that interesting unless placed into some context. I wanted to calculate the ratio of pension costs to total revenue for each city because that is a fiscal health measure. A ranking of cities by this measure is interesting – especially to cities near the top of the ranking! • The actuarial valuation reports provide actuarially required contributions for the upcoming fiscal year. I could get revenue data from CAFRs but these are published on a delayed basis. • A more timely source proved to be a data set provided by the State Controller via a Socrata Open Data platform. See http://bythenumbers.sco.ca.gov.
  • 8. Mashing up the Data and Analyzing • I now had two data sets: pension costs and revenues. • The remaining steps needed to calculate the pension cost/revenue ratios are as follows: • Add up all the plans for each city to get total city pension costs. • Map the city names in the CalPERS data set to the city names in the State Controller data set. This was generally straightforward, but there were a couple of oddities (such as Paso Robles = El Paso de Robles) • Using the common key (i.e., standardized city name), combine the two data sets • Calculate the ratio • Sort in descending order • I did the above in Excel and Google Sheets. I could have used Python or another scripting language but I find spreadsheets easier.
  • 10. Our next project: govwiki.us URL: http://govwiki.us Repo: https://github.com/govwiki/govwiki.us Online database of all US local governments. • Obtained a list of 91,000 local governments from the US census • Performed rough geocoding • Now gathering additional data from public sources in California • Hope to launch in August • Also hope to create a Wikipedia interface • Environment: MySQL, Node.js, Coffeescript
  • 11. Original PDF Liberation Presentation – 1/2014 • In January 2014, I worked with the Sunlight Foundation to host the “PDF Liberation Hackathon” in New York, Washington, Chicago and San Francisco. • A list of PDF extraction solutions and sample PDF extraction problems available at: http://pdfliberation.wordpress.com/ • Following are some slides related to that event
  • 12. An Example of How PDF Liberation Can Generate News • Working with Mortgage Resolution Partners, the City of Richmond has proposed to use its power of eminent domain to refinance mortgages for underwater homeowners • In July, the media reported that 624 properties had been chosen • I wanted to know which ones, so I filed a California Public Records Act request . . .
  • 13. The Request…(Make it Very Specific) Dear Ms. Holmes, Pursuant to my rights under the California Public Records Act (Government Code Section 6250 et seq.), I ask to obtain a copy of the following, which I understand to be held by your agency: Attachments A, B and C to letters sent to mortgage servicers offering to purchase mortgage loans dated on or about July 31, 2013. The form letter is available on the internet at http://www.contracostatimes.com/west-county-times/ci_23760190/document-city-richmond-letter-mortgage-lenders?source=pkg. I understand that 32 such letters have been sent, so this request involves as many as 96 unique documents. The purpose of this request is to obtain a list of 624 mortgages which Richmond is offering to purchase containing the property addresses, mortgage amounts, appraised values, servicer names, and, if possible, the name of the Residential Mortgage Backed Securities (RMBS) deal holding each mortgage. If you can provide this listing in a more concise format, I will accept it in lieu of the attachments described in the previous paragraph. I ask for a determination on this request within 10 days of your receipt of it, and an even prompter reply if you can make that determination without having to review the record[s] in question. If you determine that some but not all of the information is exempt from disclosure and that you intend to withhold it, I ask that you redact it for the time being and make the rest available as requested. In any event, please provide a signed notification citing the legal authorities on which you rely if you determine that any or all of the information is exempt and will not be disclosed. If I can provide any clarification that will help expedite your attention to my request, please contact me by phone at 415-578-0558 or by email at marc@publicsectorcredit.org. I ask that the requested documents be sent to be in electronic format via return email. If you must provide paper documents, I ask that you notify me of any duplication costs exceeding $50 before you duplicate the records so that I may decide which records I want copied. I can visit your office to collect the documents once they have been duplicated. Thank you for your time and attention to this matter. Sincerely, Marc D. Joffe 1655 North California Blvd. Unit 162 Walnut Creek, CA 94596
  • 15. Processing • Loaded the four PDFs into Able2Extract – a commercial PDF conversion tool that costs about $100* • Converted the PDFs to Microsoft Excel • I had now had multiple lists of properties with different fields • I sorted the lists into the same order and then joined them together into one master spreadsheet • I found that three properties had mortgage balances over $800,000 and was able to connect the balances to the addresses • This made it possible to map the properties and to see the houses themselves on Google Street View * Tabula, an open source tool, is reaching the point at which it could perform the same function.
  • 16. The Results … • Lead story in the business section of the Chronicle • Wall Street Journal blog post • Finding raised at City Council meeting • In December, Mayor Gayle McLaughlin altered the program to exclude mortgages above the conforming loan limit ($729,500) and to focus on blighted neighborhoods. By the way: The owner of the house on the right was apparently unaware that her home had been included in the program. So my initial theory that this had been a case of cronyism was not borne out.
  • 17. Some of Our Challenges • Government Financial Statements • IRS Form 990s (Non-Profit Disclosures) • House of Representative Financial Disclosures • Compiling a History of Torture
  • 19. IRS Form 990s: Finding members of the 1% who work at not-for- profits
  • 20. . . . And finding the 1% in Congress by dissecting House Financial Disclosures This project was taken on by our second place prize winner. Their best results came from using Captricty.com.
  • 21. Documenting a History of Torture: Parsing Amnesty International Annual Reports This project was taken on by our first place prize winner.
  • 22. Three Inter-Related Problems … • Extracting data from PDFs that contain embedded text • Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs • Transforming unstructured text and numbers into a form that can be readily analyzed. A related IT term is ETL (Extract-Transform-Load)
  • 23. … and some Open Source Solutions • Extracting data from PDFs that contain embedded text PDFBox, Poppler • Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs Tesseract • Transforming unstructured text and numbers into a form that can be readily analyzed. A related IT term is ETL (Extract-Transform-Load) Tabula (for table identification), OpenRefine
  • 24. … or Licensed Solutions • Extracting data from PDFs that contain embedded text PDFLib Text Extraction Tool • Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs ABBYY (FineReader or Cloud SDK) • Transforming unstructured text and numbers into a form that can be readily analyzed. A related IT term is ETL (Extract-Transform-Load) SIMX Text Converter