SlideShare a Scribd company logo
1 of 22
KEEPING
GOVERNMENTS
ACCOUNTABLE
WITH OPEN DATA
Cezary Podkul
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
KEEPING GOVERMENTS
ACCOUNTABLE WITH
OPEN DATA SCIENCE
Cezary Podkul, ProPublica | @Cezary
5/31/2015 ODSC 2015 | Boston 2
Quick Word About ProPublica
• We are a non-profit
investigative news-
room focused on
accountability
journalism
• We publish stories,
develop news apps,
tools and open source
a lot of our code at:
github.com/propublica
5/31/2015 ODSC 2015 | Boston 3
Accountability Journalism
• There is a growing need for it general, in
public finance in particular:
5/31/2015 ODSC 2015 | Boston 4
- Detroit Free Press, April 5, 1993
- Chicago Tribune, Nov. 1, 2013
- The Bond Buyer, Feb. 12, 2014
- ProPublica, Aug. 7, 2014
- Boston Herald, June 10, 2012
- BenefitsPro Feb. 12, 2015 - USA Today, Dec. 3, 2013
- Wall Street Journal, Jan. 26, 2010
- Voice of San Diego, Aug. 6, 2012
The Good News
5/31/2015 ODSC 2015 | Boston 5
• A lot of data already exists on the
finances of state and local
governments:
– Governments that borrow money from
investors provide bond offering documents
and other disclosures on EMMA
– They must also produce annual filings
called “Comprehensive Annual Financial
Reports” which detail all of their financials
The Good News: EMMA
• What is EMMA?
– Electronic Municipal Market Access
• Since 2009, the official repository
for muni bond offering documents
and continuing disclosures
• Run by the Municipal Securities
Rulemaking Board (MSRB)
3/7/2015 NICAR 2015 | Atlanta 6
• What’s in EMMA?
– Data on more than 1.2 million muni bonds:
• Official statements; ongoing financial
disclosures; advance refunding documents;
event notices, voluntary disclosures, and more
– Real-time trade data for nearly every
municipal bond bought and sold
– Political contribution disclosures (here)
– Documents, documents, more documents
3/7/2015 NICAR 2015 | Atlanta 7
The Good News: EMMA
The Bad News
5/31/2015 ODSC 2015 | Boston 8
• EMMA is great repository of info,
but little of it is easily accessible:
– PDFs, PDFs and more PDFs
• Sell a bond? Submit a PDF
• Material event happened? Tell us via PDF
• File financials? File a PDF
– No standardized reporting templates
• Important info scattered in different places
– No machine-readable bulk download
• XBRL? You wish
Things Could Be Better
5/31/2015 ODSC 2015 | Boston 9
• The SEC’s EDGAR database makes a wealth of
info available about corporations:
– Bulk download of filings available via FTP:
• http://datahub.io/dataset/edgar
• ftp://ftp.sec.gov/
– The agency is also moving away from text-based
submissions to XBRL filings:
• http://www.sec.gov/info/edgar/edgartaxonomies.shtml
– No PDFs … seriously:
• “Only documents submitted to the EDGAR system in
either plain text or HTML are official filings. PDF
documents are unofficial copies of filings. Filers may not
use the unofficial PDF copies instead of plain text or
HTML documents to meet filing requirements.”
The Result
• When IBM files its annual form 10-K, you get this:
– XBRL:
• http://www.sec.gov/Archives/edgar/data/51143/000104746915001106/i
bm-20141231_pre.xml
– Text:
• http://www.sec.gov/Archives/edgar/data/51143/0001047469-15-
001106.txt
– Even an interactive data explorer, with Excel download:
5/31/2015 ODSC 2015 | Boston 10
The Result
• When Detroit files its Comprehensive Annual
Financial Report with EMMA, you get this:
– http://emma.msrb.org/ER789294-ER614016-ER1015978.pdf
5/31/2015 ODSC 2015 | Boston 11
Happy Hunting
5/31/2015 ODSC 2015| Boston 12
• So how do you spot anomalies like these and
write about them in a systematic way?
$0
$500,000,000
$1,000,000,000
$1,500,000,000
$2,000,000,000
$2,500,000,000
$3,000,000,000
$3,500,000,000
10/2007
10/2008
10/2009
10/2010
10/2011
10/2012
10/2013
10/2014
10/2015
10/2016
10/2017
10/2018
10/2019
10/2020
10/2021
10/2022
10/2023
10/2024
10/2025
10/2026
10/2027
10/2028
10/2029
10/2030
10/2031
10/2032
10/2033
10/2034
10/2035
10/2036
10/2037
10/2038
10/2039
10/2040
10/2041
10/2042
10/2043
10/2044
10/2045
10/2046
Amountowedovertime
Ohio Series 2007B Tobacco Settlement Bonds
Principal Accreted Interest
$191.3m borrowed,
with $3.2bn due at
maturity in 2047.
Interest accrues at
7.25% interest rate,
compounded.
No option to redeem
until 2017
Example: Tobacco Bonds
5/31/2015 ODSC 2015| Boston 13
• That’s what I wanted to do for my series on tobacco
bonds – state and local debts backed by payments
from the 1998 legal settlement with Big Tobacco
Example: Tobacco Bonds
5/31/2015 ODSC 2015| Boston 14
• Problem: How do you define the sample universe?
– How many bonds are there, which ones are the anomalies?
– Searching on EMMA wasn’t much help; just links to PDFs
• Solution: Asked a data vendor, Thomson
Reuters SDC, for their list:
Source: Thomson Reuters SDC
Example: Tobacco Bonds
5/31/2015 ODSC 2015| Boston 15
• Problem: How do you vet the data?
– Need to ensure completeness and accuracy
• Solution: Lots, and lots of reading
– Re-created Thomson
Reutersdatabase from
paperfilings,zeroing-in
on38deals thatincluded
theanomalousbonds
– Logged alltheterms and
conditionswe needed to
calculate theamounts
owedonthedebt
Example: Tobacco Bonds
• Why not do it programmatically?
Wish we could have, but:
– Data often buried in
scanned PDFs like this ->
– Even if you OCR, data do
not appear in same place
across documents
– Different labels, different
conventions for reporting
– Sometimes, repayment
amounts not reported at all
5/31/2015 ODSC 2015| Boston 16
Example: Tobacco Bonds
5/31/2015 ODSC 2015| Boston 17
• Results:
– Calculatedthat,inaggregate,stateand
localgovernmentspromised torepay
$64 billionon$3 billion theyraised
byborrowingusingthesebonds
– Moneyfromtobaccosettlementwas
supposedtogoforhealthcare,instead
turnedintomulti-generationaldebt
– Thebondsarenowheadingfordefault,
promptingsomestateandlocal
governmentstobailoutbondholders
– Focusedattentiononthisissue,spurred
additionallocal,stateandnational
mediacoverage
Source: GoComics
Next Steps
• The Financial Transparency Act of 2015
has some helpful provisions in it:
• But for now it’s up to us to liberate the data
5/31/2015 ODSC 2015 | Boston 18
Source: Data Transparency Coalition
Example: Treasury.io
• API for daily spending, revenue and
debt operations data for U.S. Treasury
5/31/2015 ODSC 2015 | Boston 19
Developed by
csv soundsystem
with grant from
Knight-Mozilla
Open News Code
Sprint Grant
Example: Treasury.io
• Turns text:
5/31/2015 ODSC 2015 | Boston 20
• Into structured csv:
• Parser code available at:
https://github.com/csvsoundsystem/federal-treasury-api
Next Challenge
5/31/2015 ODSC 2015 | Boston 21
• The U.S. Treasury publishes even more
useful data in its monthly statement:
– http://www.fiscal.treasury.gov/fsreports/rpt/mth
TreasStmt/backissues.htm
• I am looking for developers interested
in helping liberate the data
– Is that you? Code repo available here:
https://github.com/csvsoundsystem/monthly-
treasury-statements
Questions?
5/31/2015 ODSC 2015 | Boston 22
cezary.podkul@propublica.org
@Cezary

More Related Content

More from odsc

Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depthodsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Informationodsc
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLodsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Namesodsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Dataodsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Scienceodsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Toolsodsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypseodsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Researchodsc
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
Agile Data
Agile DataAgile Data
Agile Dataodsc
 
Using your powers for good: Data science in the social sector
Using your powers for good: Data science in the social sectorUsing your powers for good: Data science in the social sector
Using your powers for good: Data science in the social sectorodsc
 
Machine Learning for Suits
Machine Learning for SuitsMachine Learning for Suits
Machine Learning for Suitsodsc
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
Predictive Modeling Workshop
Predictive Modeling WorkshopPredictive Modeling Workshop
Predictive Modeling Workshopodsc
 

More from odsc (20)

Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Agile Data
Agile DataAgile Data
Agile Data
 
Using your powers for good: Data science in the social sector
Using your powers for good: Data science in the social sectorUsing your powers for good: Data science in the social sector
Using your powers for good: Data science in the social sector
 
Machine Learning for Suits
Machine Learning for SuitsMachine Learning for Suits
Machine Learning for Suits
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Predictive Modeling Workshop
Predictive Modeling WorkshopPredictive Modeling Workshop
Predictive Modeling Workshop
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Keeping Governments Accountable with Open Data Science

  • 1. KEEPING GOVERNMENTS ACCOUNTABLE WITH OPEN DATA Cezary Podkul O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 2. KEEPING GOVERMENTS ACCOUNTABLE WITH OPEN DATA SCIENCE Cezary Podkul, ProPublica | @Cezary 5/31/2015 ODSC 2015 | Boston 2
  • 3. Quick Word About ProPublica • We are a non-profit investigative news- room focused on accountability journalism • We publish stories, develop news apps, tools and open source a lot of our code at: github.com/propublica 5/31/2015 ODSC 2015 | Boston 3
  • 4. Accountability Journalism • There is a growing need for it general, in public finance in particular: 5/31/2015 ODSC 2015 | Boston 4 - Detroit Free Press, April 5, 1993 - Chicago Tribune, Nov. 1, 2013 - The Bond Buyer, Feb. 12, 2014 - ProPublica, Aug. 7, 2014 - Boston Herald, June 10, 2012 - BenefitsPro Feb. 12, 2015 - USA Today, Dec. 3, 2013 - Wall Street Journal, Jan. 26, 2010 - Voice of San Diego, Aug. 6, 2012
  • 5. The Good News 5/31/2015 ODSC 2015 | Boston 5 • A lot of data already exists on the finances of state and local governments: – Governments that borrow money from investors provide bond offering documents and other disclosures on EMMA – They must also produce annual filings called “Comprehensive Annual Financial Reports” which detail all of their financials
  • 6. The Good News: EMMA • What is EMMA? – Electronic Municipal Market Access • Since 2009, the official repository for muni bond offering documents and continuing disclosures • Run by the Municipal Securities Rulemaking Board (MSRB) 3/7/2015 NICAR 2015 | Atlanta 6
  • 7. • What’s in EMMA? – Data on more than 1.2 million muni bonds: • Official statements; ongoing financial disclosures; advance refunding documents; event notices, voluntary disclosures, and more – Real-time trade data for nearly every municipal bond bought and sold – Political contribution disclosures (here) – Documents, documents, more documents 3/7/2015 NICAR 2015 | Atlanta 7 The Good News: EMMA
  • 8. The Bad News 5/31/2015 ODSC 2015 | Boston 8 • EMMA is great repository of info, but little of it is easily accessible: – PDFs, PDFs and more PDFs • Sell a bond? Submit a PDF • Material event happened? Tell us via PDF • File financials? File a PDF – No standardized reporting templates • Important info scattered in different places – No machine-readable bulk download • XBRL? You wish
  • 9. Things Could Be Better 5/31/2015 ODSC 2015 | Boston 9 • The SEC’s EDGAR database makes a wealth of info available about corporations: – Bulk download of filings available via FTP: • http://datahub.io/dataset/edgar • ftp://ftp.sec.gov/ – The agency is also moving away from text-based submissions to XBRL filings: • http://www.sec.gov/info/edgar/edgartaxonomies.shtml – No PDFs … seriously: • “Only documents submitted to the EDGAR system in either plain text or HTML are official filings. PDF documents are unofficial copies of filings. Filers may not use the unofficial PDF copies instead of plain text or HTML documents to meet filing requirements.”
  • 10. The Result • When IBM files its annual form 10-K, you get this: – XBRL: • http://www.sec.gov/Archives/edgar/data/51143/000104746915001106/i bm-20141231_pre.xml – Text: • http://www.sec.gov/Archives/edgar/data/51143/0001047469-15- 001106.txt – Even an interactive data explorer, with Excel download: 5/31/2015 ODSC 2015 | Boston 10
  • 11. The Result • When Detroit files its Comprehensive Annual Financial Report with EMMA, you get this: – http://emma.msrb.org/ER789294-ER614016-ER1015978.pdf 5/31/2015 ODSC 2015 | Boston 11
  • 12. Happy Hunting 5/31/2015 ODSC 2015| Boston 12 • So how do you spot anomalies like these and write about them in a systematic way? $0 $500,000,000 $1,000,000,000 $1,500,000,000 $2,000,000,000 $2,500,000,000 $3,000,000,000 $3,500,000,000 10/2007 10/2008 10/2009 10/2010 10/2011 10/2012 10/2013 10/2014 10/2015 10/2016 10/2017 10/2018 10/2019 10/2020 10/2021 10/2022 10/2023 10/2024 10/2025 10/2026 10/2027 10/2028 10/2029 10/2030 10/2031 10/2032 10/2033 10/2034 10/2035 10/2036 10/2037 10/2038 10/2039 10/2040 10/2041 10/2042 10/2043 10/2044 10/2045 10/2046 Amountowedovertime Ohio Series 2007B Tobacco Settlement Bonds Principal Accreted Interest $191.3m borrowed, with $3.2bn due at maturity in 2047. Interest accrues at 7.25% interest rate, compounded. No option to redeem until 2017
  • 13. Example: Tobacco Bonds 5/31/2015 ODSC 2015| Boston 13 • That’s what I wanted to do for my series on tobacco bonds – state and local debts backed by payments from the 1998 legal settlement with Big Tobacco
  • 14. Example: Tobacco Bonds 5/31/2015 ODSC 2015| Boston 14 • Problem: How do you define the sample universe? – How many bonds are there, which ones are the anomalies? – Searching on EMMA wasn’t much help; just links to PDFs • Solution: Asked a data vendor, Thomson Reuters SDC, for their list: Source: Thomson Reuters SDC
  • 15. Example: Tobacco Bonds 5/31/2015 ODSC 2015| Boston 15 • Problem: How do you vet the data? – Need to ensure completeness and accuracy • Solution: Lots, and lots of reading – Re-created Thomson Reutersdatabase from paperfilings,zeroing-in on38deals thatincluded theanomalousbonds – Logged alltheterms and conditionswe needed to calculate theamounts owedonthedebt
  • 16. Example: Tobacco Bonds • Why not do it programmatically? Wish we could have, but: – Data often buried in scanned PDFs like this -> – Even if you OCR, data do not appear in same place across documents – Different labels, different conventions for reporting – Sometimes, repayment amounts not reported at all 5/31/2015 ODSC 2015| Boston 16
  • 17. Example: Tobacco Bonds 5/31/2015 ODSC 2015| Boston 17 • Results: – Calculatedthat,inaggregate,stateand localgovernmentspromised torepay $64 billionon$3 billion theyraised byborrowingusingthesebonds – Moneyfromtobaccosettlementwas supposedtogoforhealthcare,instead turnedintomulti-generationaldebt – Thebondsarenowheadingfordefault, promptingsomestateandlocal governmentstobailoutbondholders – Focusedattentiononthisissue,spurred additionallocal,stateandnational mediacoverage Source: GoComics
  • 18. Next Steps • The Financial Transparency Act of 2015 has some helpful provisions in it: • But for now it’s up to us to liberate the data 5/31/2015 ODSC 2015 | Boston 18 Source: Data Transparency Coalition
  • 19. Example: Treasury.io • API for daily spending, revenue and debt operations data for U.S. Treasury 5/31/2015 ODSC 2015 | Boston 19 Developed by csv soundsystem with grant from Knight-Mozilla Open News Code Sprint Grant
  • 20. Example: Treasury.io • Turns text: 5/31/2015 ODSC 2015 | Boston 20 • Into structured csv: • Parser code available at: https://github.com/csvsoundsystem/federal-treasury-api
  • 21. Next Challenge 5/31/2015 ODSC 2015 | Boston 21 • The U.S. Treasury publishes even more useful data in its monthly statement: – http://www.fiscal.treasury.gov/fsreports/rpt/mth TreasStmt/backissues.htm • I am looking for developers interested in helping liberate the data – Is that you? Code repo available here: https://github.com/csvsoundsystem/monthly- treasury-statements
  • 22. Questions? 5/31/2015 ODSC 2015 | Boston 22 cezary.podkul@propublica.org @Cezary