Introduction
Paul Bradshaw
Data journalism
Ivy Lee
“Each weekday, my computer
program goes to the Chicago
Police Department's website and
gathers all crimes reported in
Chicago.”
Adrian Holovaty
Great stories
Engagement
Targeting/relevance
Why?
“The Tribune’s biggest magnet
by far has been its more than
three dozen interactive
databases, which collectively
have drawn three times as many
page views as the site’s stories.”
http://bit.ly/dj2dmz
Times film genres
Data Journalism Continuum
1. Finding data
What is data?
Numbers
Text
Connections
Live data
Behavioural data
Images, audio, video
Anything that a computer can
work with
Start with the data and look for the
stories? (MPs’ expenses)
Or start with a lead and look for the data?
Passive vs active
data journalism
Data.gov.uk
What Do They Know
Openlylocal, Scraperwiki
Disclosure logs
RSS feeds, XML, structured data
Some UK projects
Delicious.com/paulb/car
CAR
Advanced search by file type
“Performance figures” Filetype: pdf
Filetype: xls
Filetype: doc
Filetype: ppt
Filetype: rdf OR xml
Advanced search by domain
“Disclosure logs” site: .gov.es
Database site: .org.cat OR .org
+Tables –chairs site:
Health, police, military domains
Use overseas sources
• US medicine databases
• EU subsidy databases
• Swedish people data
• International police agency
correspondence
Scraping
Scraping can automate & schedule the
gathering process if there are multiple
sources
Tools: OutWit Hub plugin, Yahoo! Pipes,
Scraperwiki, Google Spreadsheets
formulae
Interrogating data
Humans collect data
Humans enter data
Human error
Time spent now...
Different words for the same thing
Double spaces, punctuation
Wrong data type
Mistyped
Duplicate entries
Default entries (1/1/00)
...Saves time later
"Because we take the time to clean the
data, we are able to do lobbying stories
no other news organisation can do."
David Donald,
Center for Public Integrity
Group by term then sort to see
duplications
Find & replace double spaces, etc.
Select column/row & check data type
Sort to find unusually large/small, and
neighbouring misspellings
Cleaning methods
Never publish a name from data without
running a background check
Check.
Other tools
Freebase Gridworks:
see http://vimeo.com/10081183
Visualising data
or http://chartchooser.juiceanalytics.com/
(trends, dips, correlations)
(comparison, themes)
(proportions, comparison)
Mashing data
Geocoded data with map
- Live data (e.g. Twitter API)
- Static data (e.g. Google Docs)
- Dynamic data (e.g. Google Form)
2 spreadsheets with common data
- Tools: MySQL, Access, etc.
Combining data sources
Twittermap
Wikipedia map
NYT Property
Guardian vs Nature
BBC Most Read
BBC Olympic Village
Combining data sources
Big events (protests, Olympics,
inauguration)
Comparisons
Geocoded data
Connections
What mashes well?
Aggregates
Maps
Filters
Counts
Cleans or reformats (regex)
Yahoo! Pipes
Scraperwiki – mapping library
Maptube – combine maps
Google Docs – publish in different
formats
+++
Other tools
Computer-readable data
Paris – France, Texas, or Hilton?
Unique identifiers – usually URI
RDF, RDFa, XML, etc.
Semantic web & linked data
Application Programming Interface
Build on top of data
Google Maps, Twitter, Facebook, Digg,
Guardian, NYT, NPR, They Work For
You, etc.
API
Slideshare.net/onlinejournalist
Twitter.com/paulbradshaw
Q&A
Delicious.com/paulb/datajournalism
Delicious.com/paulb/visualisation
Delicious.com/paulb/statistics
Bookmarks

New information for new journalists pt2: data