Data Journalism 101 workshop, presented by AP data journalist Serdar Tumgoren on April 29, 2014 to Bay Area journalists. Organized by the Society of Professional Journalists - Northern California chapter.
What is data journalism?
“Wrangling, vetting and visualizing data to bring
forth news stories in the public interest that we
never would have found otherwise.”
- Garance Burke, AP data journalist
“A data journalist is anyone ...who can fluently
work with this primary source [data]. It’s the
same as a traditional reporter, who should
know how to hunt down human sources and
- Me (I know, so lame to quote yourself)
“Data journalism is a form of reporting that
makes use of structured data (e.g.
spreadsheets, databases) as a key component
of researching and telling stories.”
- Chad Skelton, data journalist at Vancouver
Sun and journalism instructor
“Data can be the source of data journalism, or it
can be the tool with which the story is told — or
it can be both. Like any source, it should be
treated with scepticism; and like any tool, we
should be conscious of how it can shape and
restrict the stories that are created with it.”
- Paul Bradshaw, Data Journalism Handbook
Don’t try to be a Journicorn.
(Hint: They don’t exist.)
Be a journalist who uses data.
Data is just another source.
Start with a Question, then Data
● Are housing prices going up?
● Do reports of falling crime bear out across
the entire city?
● Are developers helping to finance
campaigns of politicians who approved
● Are public employee salaries on the rise?
● Public agencies (local, county, state, federal)
● Data.gov sites
● Social networking sites (often APIs)
● Nonprofits/industry experts
● Academic institutions
● Manually gathered
Not everything is on the web.
A whole world of data may never see light of
day on gov websites. How do you find it?
● Government forms provide clues
● Gov employees
● Software contracts and manuals
● Building permits
● Campaign finance
● Corporate records
● Planning & Zoning
● Land records
● Etc. Etc.
Open Records Laws
● Know and understand your rights
● Try to negotiate first
● Seek expert advice (CalAware, CFAC)
● Don’t go fishing; craft targeted requests
● Follow through on requests
● RCFP Open Gov Guide
● RCFP Letter Generator
● FOIA Machine
● Experts: CalAware and CFAC
Understand the Data.
● What is the origin of the data?
● What do the fields mean?
● What rules surround the data?
● Seek expert advice and sanity checks.
Wrangle the Data.
● What format is the source data?
● How do I convert the data for tool of choice?
● Explore the data. Is it dirty?
● What cleanups are needed to answer my
Sort, Filter, Sum, etc.
● Spreadsheets can take you far.
● Aggregate functions in SQL.
● Patterns and outliers in stats programs.
Add tools as needed.
Tools are abundant, free and paid.
Knowledge is abundant, freely shared*.
Most often data is a starting point or
supplement. Check conclusions in the real
world and circle back to refine and qualify data
If you’re a visual person...
...confounded by the last few bits (like me)...
Talk to people
“What data do I need to
answer my question?”
Get The Data
Clean The Data
Check The Data
Interview The Data Interview People
Display The Data
Tell The Story
The Data Journalism Process
Story idea is the key.
Most stats were already available and
supported or confirmed by reporting. But we
wanted county breakdowns for 2013 (most
recent full year of granular data). So...
Data wrangling ain’t pretty.
We got (dirty) data for 2013.
● copy/paste -> Excel = Fail
● pdftk -> CSV -> Excel = Fail
● pdftk -> CSV -> python -> Excel = Success
Check the data.
A few strategies to ensure accuracy:
● Manually calculate a sample of subtotals,
compare to calculated results.
● Compare totals to summary stats from third
● Have someone else check your work.
Keep a Data Diary
● Document data sources
● Document field descriptions, quirks, etc.
● Document data cleaning process
● Document analysis