Turning Documents into Data
@dataeditor
#ONA17
bit.ly/docs2data
Documents
are
Data
We often think of
data as spreadsheets
but much of it is
unstructured
Not every set of
documents needs
to be structured to
be useful
Let’s add
Some
Structure
Adding structure
often means
putting data in a
spreadsheet
Sometimes you can
extract the data
because it’s already
structured
Sometimes the data
is sort of
structured, but
needs some help
Sometimes you just
have to do things
the old-fashioned
way
And sometimes you
have to get super
creative
Overarching thoughts
● “Documents and data don’t lie” is true
○ Most of the time
● Government agencies keep documents in a data format
○ Some of the time
● Asking for the data is always the easiest route
○ If that fails, scraping and conversion is next step
● Assembling your own data from documents sets you apart
● Make your data available when you’re done
○ Don’t force others to reinvent the wheel
Tools of the trade
● Microsoft Excel or other spreadsheet programs
● Tabula
● Adobe Acrobat
● CometDocs
○ If you’re an IRE member, you can get a free premium
account
● Python or R for scraping
● Command-line tools
○ Tesseract
○ ImageMagick
● DocumentCloud
Turning Documents into Data
@dataeditor
#ONA17
bit.ly/docs2data

Turning Documents into Data — Steven Rich