“ The Tribune’s biggest magnet by far has been its more than three dozen interactive databases , which collectively have drawn three times as many page views as the site’s stories .” http://bit.ly/dj2dmz
Humans collect data Humans enter data Human error Time spent now...
Different words for the same thing Double spaces, punctuation Wrong data type Mistyped Duplicate entries Default entries (1/1/00) ...Saves time later
"Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do." David Donald, Center for Public Integrity
Group by term then sort to see duplications Find & replace double spaces, etc. Select column/row & check data type Sort to find unusually large/small, and neighbouring misspellings Cleaning methods
Never publish a name from data without running a background check Check.
Other tools Freebase Gridworks: see http://vimeo.com/10081183
Geocoded data with map - Live data (e.g. Twitter API) - Static data (e.g. Google Docs) - Dynamic data (e.g. Google Form) 2 spreadsheets with common data - Tools: MySQL, Access, etc. Combining data sources