Your SlideShare is downloading. ×
Cleaning data with Google Refine
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Cleaning data with Google Refine


Published on

Presentation at the Global Investigative Journalism Conference, Kiev, Ukraine, 15 Oct 2011

Presentation at the Global Investigative Journalism Conference, Kiev, Ukraine, 15 Oct 2011

Published in: Education, Technology, Travel

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Get yourself ready • Google ‘Google Refine download’ refine/wiki/Downloads • Download and install Google Refine • Download data at • Open it up - it should open in a browser at, 15 October 2011
  • 2. Google Refine: cleaning data Paul Bradshaw,, 15 October 2011
  • 3. In a nutshell... • Getting rid of common data problems • ‘Clustering’ data to clean up multiple names for same thing • Manual tidyingSaturday, 15 October 2011
  • 4. The basics Common transformsSaturday, 15 October 2011
  • 5. What can you do with Google Refine? • Clean common data problems: wrong format, inconsistent case, HTML, spaces, etc. • Use algorithms to find similar items • Use APIs and GREL to add new dataSaturday, 15 October 2011
  • 6. David Donald, Center for Public Integrity "Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do."Saturday, 15 October 2011
  • 7. Time spent now... Humans collect data Humans enter data Human errorSaturday, 15 October 2011
  • 8. ...Saves time later Different words for the same thing Double spaces, punctuation Wrong data type Mistyped Duplicate entries Default entries (1/1/00)Saturday, 15 October 2011
  • 9. First! Save some copies of the raw data Work on a new copy Save versions as you go to revert Note: Docs limited to 200,000 cells/256 cols; some Excel limited to 66,000 rowsSaturday, 15 October 2011
  • 10. Cleaning methods Group by term to see duplications Find & replace double spaces, etc. Select column/row & check data type Sort to find unusually large/small, and neighbouring misspellingsSaturday, 15 October 2011
  • 11. Check. Never publish a name from data without running a background checkSaturday, 15 October 2011
  • 12. Edit cells>Common transformsSaturday, 15 October 2011
  • 13. FacetsSaturday, 15 October 2011
  • 14. Facets, Edit cells Edit cells > common transforms > cluster & edit > unescape HTML Edit cells > split multi-valued cells Facet > text facet Export...Saturday, 15 October 2011
  • 15. Clustering An intelligent helperSaturday, 15 October 2011
  • 16. Algorithms Fingerprint: looks for items with identical characters, e.g. “John Smith,” and “Smith, John” Double-metaphone: looks for similar sounds, e.g. “Horowitz” and “Horowicz” PPM: partial matches - try increasing radius to increaseSaturday, 15 October 2011
  • 17. Algorithms Nearest neighbor: looks for shared clusters of characters, e.g. “Johnson” and “Johnsons” Levenshtein: looks for number of edits needed to change one to another, e.g. “New York” -> “newyork” = 3 editsSaturday, 15 October 2011
  • 18. Just a helper... Check and tick to apply the cleanup - click ‘Browse this cluster’ to see in more detail. Research to check if there are 2 people with same name Will not spot abbreviations, e.g. MOJ vs Ministry of JusticeSaturday, 15 October 2011
  • 19. Saturday, 15 October 2011
  • 20. Links google-refineSaturday, 15 October 2011