This document summarizes Marc Joffe's presentation on extracting and analyzing data from municipal financial disclosures. It discusses gathering pension data from over 1,400 PDF reports published by CalPERS on city pension plans in California. It describes downloading the PDFs, extracting text data using Python scripts, and loading the extracted data into spreadsheets. It also discusses combining the pension data with revenue data from the State Controller to calculate ratios of pension costs to total revenue for each city.
08448380779 Call Girls In Friends Colony Women Seeking Men
Â
Analyzing Municipal Financial Data
1. EXTRACTING & ANALYZING
DATA FROM MUNICIPAL
FINANCIAL DISCLOSURES
Marc Joffe
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
2. Extracting and Analyzing Data
from Municipal Financial
Disclosures
Marc Joffe
Public Sector Credit Solutions
Open Data Science Conference
Boston, May 2015
3. The Research Question
• How is the cost of funding public employee pensions affecting
California cities?
• I hoped to answer the question by gathering pension expenditure
data for all cities in the state.
• Main data points:
• Current and future contribution amounts
• Funded ratio
4. Data on City Pensions
• The best sources for information on local government pension costs
are (1) the municipality’s audited financial statements (CAFRs) and (2)
actuarial valuation reports published by the pension fund.
• In California (and some other states), most cities rely on a multi-
employer pension system. The system in California, CalPERS,
publishes one actuarial report for each local government pension
plan it administers – about 3000 in all.
• I was just interested in the roughly 1400 plans covering city
employees. CalPERS publishes a unique PDF for each plan.
• The main challenge is thus to get the 1400 PDFs and extract key data
points (such as future actuarially required contributions) from them.
5. Gathering the Pension Data (1 of 2)
• Found a web page that had links to all the actuarial valuation PDFs.
• In this case: http://www.calpers.ca.gov/index.jsp?bc=/about/forms-
pubs/calpers-reports/actuarial-reports/home.xml
• Downloaded this page and scraped all the links
• This can be done with a python script (ideally leveraging an HTML processing
library like BeautifulSoup) or by copying/pasting to Excel. When copying
content from a web page to Excel, it is better to use Internet Explorer than
other browsers.
• Ran a command line script to download all the links. This shell script
or windows command file can use curl or wget to retrieve the PDFs.
6. Gathering the Pension Data (2 of 2)
• Because the valuation PDFs have embedded text, no OCR was
necessary. I pulled out the text with Poppler’s pdftotext command
line executable, using the –layout option to make the outputs more
readable.
• Because the PDFs had very consistent formats (they appear to have
been output by a report generator), I could take advantage of
patterns in the text. I wrote Python scripts to read each file and
extract just the portions I needed. I output the strings I captured to a
CSV file.
• I loaded the CSV file into Excel for further analysis.
7. Answering the “So What Question” with Revenue Data
• The raw pension numbers are not that interesting unless placed into
some context. I wanted to calculate the ratio of pension costs to total
revenue for each city because that is a fiscal health measure. A
ranking of cities by this measure is interesting – especially to cities
near the top of the ranking!
• The actuarial valuation reports provide actuarially required
contributions for the upcoming fiscal year. I could get revenue data
from CAFRs but these are published on a delayed basis.
• A more timely source proved to be a data set provided by the State
Controller via a Socrata Open Data platform. See
http://bythenumbers.sco.ca.gov.
8. Mashing up the Data and Analyzing
• I now had two data sets: pension costs and revenues.
• The remaining steps needed to calculate the pension cost/revenue ratios
are as follows:
• Add up all the plans for each city to get total city pension costs.
• Map the city names in the CalPERS data set to the city names in the State Controller
data set. This was generally straightforward, but there were a couple of oddities
(such as Paso Robles = El Paso de Robles)
• Using the common key (i.e., standardized city name), combine the two data sets
• Calculate the ratio
• Sort in descending order
• I did the above in Excel and Google Sheets. I could have used Python or
another scripting language but I find spreadsheets easier.
10. Our next project: govwiki.us
URL: http://govwiki.us
Repo: https://github.com/govwiki/govwiki.us
Online database of all US local governments.
• Obtained a list of 91,000 local governments from
the US census
• Performed rough geocoding
• Now gathering additional data from public
sources in California
• Hope to launch in August
• Also hope to create a Wikipedia interface
• Environment: MySQL, Node.js, Coffeescript
11. Original PDF Liberation Presentation – 1/2014
• In January 2014, I worked with the Sunlight Foundation to host the
“PDF Liberation Hackathon” in New York, Washington, Chicago and
San Francisco.
• A list of PDF extraction solutions and sample PDF extraction problems
available at: http://pdfliberation.wordpress.com/
• Following are some slides related to that event
12. An Example of How PDF Liberation Can
Generate News
• Working with Mortgage Resolution Partners, the City of Richmond has
proposed to use its power of eminent domain to refinance mortgages
for underwater homeowners
• In July, the media reported that 624 properties had been chosen
• I wanted to know which ones, so I filed a California Public Records Act
request . . .
13. The Request…(Make it Very Specific)
Dear Ms. Holmes,
Pursuant to my rights under the California Public Records Act (Government Code Section 6250 et seq.), I ask to obtain a copy of the following, which I understand to be held by your agency:
Attachments A, B and C to letters sent to mortgage servicers offering to purchase mortgage loans dated on or about July 31, 2013. The form letter is available on the internet at
http://www.contracostatimes.com/west-county-times/ci_23760190/document-city-richmond-letter-mortgage-lenders?source=pkg. I understand that 32 such letters have been sent, so this request
involves as many as 96 unique documents.
The purpose of this request is to obtain a list of 624 mortgages which Richmond is offering to purchase containing the property addresses, mortgage amounts, appraised values, servicer names, and, if
possible, the name of the Residential Mortgage Backed Securities (RMBS) deal holding each mortgage. If you can provide this listing in a more concise format, I will accept it in lieu of the attachments
described in the previous paragraph.
I ask for a determination on this request within 10 days of your receipt of it, and an even prompter reply if you can make that determination without having to review the record[s] in question.
If you determine that some but not all of the information is exempt from disclosure and that you intend to withhold it, I ask that you redact it for the time being and make the rest available as
requested.
In any event, please provide a signed notification citing the legal authorities on which you rely if you determine that any or all of the information is exempt and will not be disclosed.
If I can provide any clarification that will help expedite your attention to my request, please contact me by phone at 415-578-0558 or by email at marc@publicsectorcredit.org. I ask that the requested
documents be sent to be in electronic format via return email. If you must provide paper documents, I ask that you notify me of any duplication costs exceeding $50 before you duplicate the records so
that I may decide which records I want copied. I can visit your office to collect the documents once they have been duplicated.
Thank you for your time and attention to this matter.
Sincerely,
Marc D. Joffe
1655 North California Blvd. Unit 162
Walnut Creek, CA 94596
15. Processing
• Loaded the four PDFs into Able2Extract – a commercial PDF conversion tool that
costs about $100*
• Converted the PDFs to Microsoft Excel
• I had now had multiple lists of properties with different fields
• I sorted the lists into the same order and then joined them together into one
master spreadsheet
• I found that three properties had mortgage balances over $800,000 and was able
to connect the balances to the addresses
• This made it possible to map the properties and to see the houses themselves on
Google Street View
* Tabula, an open source tool, is reaching the point at which it could perform the same function.
16. The Results …
• Lead story in the business section of the Chronicle
• Wall Street Journal blog post
• Finding raised at City Council meeting
• In December, Mayor Gayle McLaughlin altered the program to
exclude mortgages above the conforming loan limit ($729,500)
and to focus on blighted neighborhoods.
By the way:
The owner of the house on the right was apparently unaware
that her home had been included in the program. So my initial
theory that this had been a case of cronyism was not borne out.
17. Some of Our Challenges
• Government Financial Statements
• IRS Form 990s (Non-Profit Disclosures)
• House of Representative Financial Disclosures
• Compiling a History of Torture
20. . . . And
finding the 1%
in Congress by
dissecting
House
Financial
Disclosures
This project was taken on by our second place prize winner. Their best results came from using Captricty.com.
21. Documenting a History of Torture: Parsing
Amnesty International Annual Reports
This project was taken on by our first place prize winner.
22. Three Inter-Related Problems …
• Extracting data from PDFs that contain embedded text
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
23. … and some Open Source Solutions
• Extracting data from PDFs that contain embedded text
PDFBox, Poppler
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
Tesseract
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
Tabula (for table identification), OpenRefine
24. … or Licensed Solutions
• Extracting data from PDFs that contain embedded text
PDFLib Text Extraction Tool
• Using Optical Character Recognition (OCR) to generate text from PDFs
of scans or photographs
ABBYY (FineReader or Cloud SDK)
• Transforming unstructured text and numbers into a form that can be
readily analyzed. A related IT term is ETL (Extract-Transform-Load)
SIMX Text Converter