Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
PDF Liberation Hackathon
San Francisco - January 2014
Thanks for Coming
• Extracting and organizing unstructured data may be less exciting than
creating visualizations, but it’...
Some of Our Challenges
• Government Financial Statements
• IRS Form 990s (Non-Profit Disclosures)
• House of Representativ...
Government Financial Statements:
Finding the Next Detroit
IRS
Form
990s:
Finding
members
of the
1% who
work at
not-forprofits
. . . And
finding the
1% in
Congress by
dissecting
House
Financial
Disclosures
Documenting a History of Torture:
Parsing Amnesty International Annual
Reports
Three Inter-Related Problems …
• Extracting data from PDFs that contain embedded text

• Using Optical Character Recogniti...
… and some Open Source Solutions
• Extracting data from PDFs that contain embedded text
PDFBox, Poppler
• Using Optical Ch...
… or Licensed Solutions
• Extracting data from PDFs that contain embedded text
PDFLib Text Extraction Tool
• Using Optical...
My Advice
• Choose a pre-specified challenge or pick another type of PDF that interests
you
• Establish a clear idea of wh...
Rules
• Trying to keep rules at a minimum!
• You can work at RallyPad or anywhere else
• Unless I hear a groundswell of pr...
Upcoming SlideShare
Loading in …5
×

PDF Liberation Hackathon - San Francisco

638 views

Published on

Orientation slides for the PDF Liberation Hackathon in San Francisco.

Published in: Technology
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/HDy8X ◀ ◀ ◀ ◀
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • D0WNL0AD FULL ▶ ▶ ▶ ▶ http://1lite.top/HDy8X ◀ ◀ ◀ ◀
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

PDF Liberation Hackathon - San Francisco

  1. 1. PDF Liberation Hackathon San Francisco - January 2014
  2. 2. Thanks for Coming • Extracting and organizing unstructured data may be less exciting than creating visualizations, but it’s also important! • Civic applications include: • Open Government / Government Transparency • Data Journalism • This work also has commercial applications, which is why expensive enterprise software has been created to address this problem.
  3. 3. Some of Our Challenges • Government Financial Statements • IRS Form 990s (Non-Profit Disclosures) • House of Representative Financial Disclosures • Compiling a History of Torture
  4. 4. Government Financial Statements: Finding the Next Detroit
  5. 5. IRS Form 990s: Finding members of the 1% who work at not-forprofits
  6. 6. . . . And finding the 1% in Congress by dissecting House Financial Disclosures
  7. 7. Documenting a History of Torture: Parsing Amnesty International Annual Reports
  8. 8. Three Inter-Related Problems … • Extracting data from PDFs that contain embedded text • Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs • Transforming unstructured text and numbers into a form that can be readily analyzed. A related IT term is ETL (Extract-Transform-Load)
  9. 9. … and some Open Source Solutions • Extracting data from PDFs that contain embedded text PDFBox, Poppler • Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs Tesseract • Transforming unstructured text and numbers into a form that can be readily analyzed. A related IT term is ETL (Extract-Transform-Load) Tabula (for table identification), OpenRefine
  10. 10. … or Licensed Solutions • Extracting data from PDFs that contain embedded text PDFLib Text Extraction Tool • Using Optical Character Recognition (OCR) to generate text from PDFs of scans or photographs ABBYY (FineReader of Cloud SDK) • Transforming unstructured text and numbers into a form that can be readily analyzed. A related IT term is ETL (Extract-Transform-Load) SIMX Text Converter
  11. 11. My Advice • Choose a pre-specified challenge or pick another type of PDF that interests you • Establish a clear idea of what data you want to extract and how to arrange it • Determine which of the three operations have to be performed on the PDFs • Test a couple of tools with the PDFs you’re working with to see which work better or decide to do your own scratch development • Put together your solution, test it and check it into GitHub • Don’t get discouraged: you may still have a great project even if you can only achieve partial automation. This is about reducing manual work, not necessarily eliminating it.
  12. 12. Rules • Trying to keep rules at a minimum! • You can work at RallyPad or anywhere else • Unless I hear a groundswell of protest, hours at RallyPad will be: • Tonight until 10:30 • Tomorrow from 8:00 to 6:00 • Sunday from 8:00 until judging which will start at Noon • To be eligible for judging: • Your code must be fully open and checked into Github by Noon Sunday • You can use open source or licensed components, but if you use the latter, trial limitations must not handicap your solution at judging

×