Cleaning data with Google Refine
Upcoming SlideShare
Loading in...5
×
 

Cleaning data with Google Refine

on

  • 3,676 views

Presentation at the Global Investigative Journalism Conference, Kiev, Ukraine, 15 Oct 2011

Presentation at the Global Investigative Journalism Conference, Kiev, Ukraine, 15 Oct 2011

Statistics

Views

Total Views
3,676
Views on SlideShare
3,634
Embed Views
42

Actions

Likes
0
Downloads
20
Comments
0

3 Embeds 42

http://eventifier.co 28
http://paper.li 12
https://twitter.com 2

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Cleaning data with Google Refine Cleaning data with Google Refine Presentation Transcript

    • Get yourself ready • Google ‘Google Refine download’ http://code.google.com/p/google- refine/wiki/Downloads • Download and install Google Refine • Download data at http://bit.ly/nqbIaI • Open it up - it should open in a browser at http://127.0.0.1:3333/Saturday, 15 October 2011
    • Google Refine: cleaning data Paul Bradshaw OnlineJournalismBlog.com, Twitter.com/paulbradshawSaturday, 15 October 2011
    • In a nutshell... • Getting rid of common data problems • ‘Clustering’ data to clean up multiple names for same thing • Manual tidyingSaturday, 15 October 2011
    • The basics Common transformsSaturday, 15 October 2011
    • What can you do with Google Refine? • Clean common data problems: wrong format, inconsistent case, HTML, spaces, etc. • Use algorithms to find similar items • Use APIs and GREL to add new dataSaturday, 15 October 2011
    • David Donald, Center for Public Integrity "Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do."Saturday, 15 October 2011
    • Time spent now... Humans collect data Humans enter data Human errorSaturday, 15 October 2011
    • ...Saves time later Different words for the same thing Double spaces, punctuation Wrong data type Mistyped Duplicate entries Default entries (1/1/00)Saturday, 15 October 2011
    • First! Save some copies of the raw data Work on a new copy Save versions as you go to revert Note: Docs limited to 200,000 cells/256 cols; some Excel limited to 66,000 rowsSaturday, 15 October 2011
    • Cleaning methods Group by term to see duplications Find & replace double spaces, etc. Select column/row & check data type Sort to find unusually large/small, and neighbouring misspellingsSaturday, 15 October 2011
    • Check. Never publish a name from data without running a background checkSaturday, 15 October 2011
    • Edit cells>Common transformsSaturday, 15 October 2011
    • FacetsSaturday, 15 October 2011
    • Facets, Edit cells Edit cells > common transforms > cluster & edit > unescape HTML Edit cells > split multi-valued cells Facet > text facet Export...Saturday, 15 October 2011
    • Clustering An intelligent helperSaturday, 15 October 2011
    • Algorithms Fingerprint: looks for items with identical characters, e.g. “John Smith,” and “Smith, John” Double-metaphone: looks for similar sounds, e.g. “Horowitz” and “Horowicz” PPM: partial matches - try increasing radius to increaseSaturday, 15 October 2011
    • Algorithms Nearest neighbor: looks for shared clusters of characters, e.g. “Johnson” and “Johnsons” Levenshtein: looks for number of edits needed to change one to another, e.g. “New York” -> “newyork” = 3 editsSaturday, 15 October 2011
    • Just a helper... Check and tick to apply the cleanup - click ‘Browse this cluster’ to see in more detail. Research to check if there are 2 people with same name Will not spot abbreviations, e.g. MOJ vs Ministry of JusticeSaturday, 15 October 2011
    • Saturday, 15 October 2011
    • Links Delicious.com/paulb/kiev11 Delicious.com/paulb/googlerefine OnlineJournalismBlog.com/tag/ google-refineSaturday, 15 October 2011