Your SlideShare is downloading. ×
0
Get yourself ready   • Google ‘Google Refine download’   http://code.google.com/p/google-   refine/wiki/Downloads   • Down...
Google Refine:                                   cleaning data                                                            ...
In a nutshell...   • Getting rid of common data   problems   • ‘Clustering’ data to clean up multiple   names for same thi...
The basics                                Common transformsSaturday, 15 October 2011
What can you do with Google  Refine?   • Clean common data problems:   wrong format, inconsistent case,   HTML, spaces, et...
David Donald,   Center for Public Integrity  "Because we take the time to clean the  data, we are able to do lobbying stor...
Time spent now...  Humans collect data  Humans enter data  Human errorSaturday, 15 October 2011
...Saves time later  Different words for the same thing  Double spaces, punctuation  Wrong data type  Mistyped  Duplicate ...
First!  Save some copies of the raw data  Work on a new copy  Save versions as you go to revert  Note: Docs limited to 200...
Cleaning methods  Group by term to see duplications  Find & replace double spaces, etc.  Select column/row & check data ty...
Check.  Never publish a name from data without  running a background checkSaturday, 15 October 2011
Edit cells>Common transformsSaturday, 15 October 2011
FacetsSaturday, 15 October 2011
Facets, Edit cells   Edit cells > common transforms   > cluster & edit   > unescape HTML   Edit cells > split multi-valued...
Clustering                                An intelligent helperSaturday, 15 October 2011
Algorithms Fingerprint: looks for items with identical characters, e.g. “John Smith,” and “Smith, John” Double-metaphone: ...
Algorithms Nearest neighbor: looks for shared clusters of characters, e.g. “Johnson” and “Johnsons” Levenshtein: looks for...
Just a helper... Check and tick to apply the cleanup - click ‘Browse this cluster’ to see in more detail. Research to chec...
Saturday, 15 October 2011
Links   Delicious.com/paulb/kiev11   Delicious.com/paulb/googlerefine   OnlineJournalismBlog.com/tag/   google-refineSatur...
Upcoming SlideShare
Loading in...5
×

Cleaning data with Google Refine

3,183

Published on

Presentation at the Global Investigative Journalism Conference, Kiev, Ukraine, 15 Oct 2011

Published in: Education, Technology, Travel
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,183
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Cleaning data with Google Refine"

  1. 1. Get yourself ready • Google ‘Google Refine download’ http://code.google.com/p/google- refine/wiki/Downloads • Download and install Google Refine • Download data at http://bit.ly/nqbIaI • Open it up - it should open in a browser at http://127.0.0.1:3333/Saturday, 15 October 2011
  2. 2. Google Refine: cleaning data Paul Bradshaw OnlineJournalismBlog.com, Twitter.com/paulbradshawSaturday, 15 October 2011
  3. 3. In a nutshell... • Getting rid of common data problems • ‘Clustering’ data to clean up multiple names for same thing • Manual tidyingSaturday, 15 October 2011
  4. 4. The basics Common transformsSaturday, 15 October 2011
  5. 5. What can you do with Google Refine? • Clean common data problems: wrong format, inconsistent case, HTML, spaces, etc. • Use algorithms to find similar items • Use APIs and GREL to add new dataSaturday, 15 October 2011
  6. 6. David Donald, Center for Public Integrity "Because we take the time to clean the data, we are able to do lobbying stories no other news organisation can do."Saturday, 15 October 2011
  7. 7. Time spent now... Humans collect data Humans enter data Human errorSaturday, 15 October 2011
  8. 8. ...Saves time later Different words for the same thing Double spaces, punctuation Wrong data type Mistyped Duplicate entries Default entries (1/1/00)Saturday, 15 October 2011
  9. 9. First! Save some copies of the raw data Work on a new copy Save versions as you go to revert Note: Docs limited to 200,000 cells/256 cols; some Excel limited to 66,000 rowsSaturday, 15 October 2011
  10. 10. Cleaning methods Group by term to see duplications Find & replace double spaces, etc. Select column/row & check data type Sort to find unusually large/small, and neighbouring misspellingsSaturday, 15 October 2011
  11. 11. Check. Never publish a name from data without running a background checkSaturday, 15 October 2011
  12. 12. Edit cells>Common transformsSaturday, 15 October 2011
  13. 13. FacetsSaturday, 15 October 2011
  14. 14. Facets, Edit cells Edit cells > common transforms > cluster & edit > unescape HTML Edit cells > split multi-valued cells Facet > text facet Export...Saturday, 15 October 2011
  15. 15. Clustering An intelligent helperSaturday, 15 October 2011
  16. 16. Algorithms Fingerprint: looks for items with identical characters, e.g. “John Smith,” and “Smith, John” Double-metaphone: looks for similar sounds, e.g. “Horowitz” and “Horowicz” PPM: partial matches - try increasing radius to increaseSaturday, 15 October 2011
  17. 17. Algorithms Nearest neighbor: looks for shared clusters of characters, e.g. “Johnson” and “Johnsons” Levenshtein: looks for number of edits needed to change one to another, e.g. “New York” -> “newyork” = 3 editsSaturday, 15 October 2011
  18. 18. Just a helper... Check and tick to apply the cleanup - click ‘Browse this cluster’ to see in more detail. Research to check if there are 2 people with same name Will not spot abbreviations, e.g. MOJ vs Ministry of JusticeSaturday, 15 October 2011
  19. 19. Saturday, 15 October 2011
  20. 20. Links Delicious.com/paulb/kiev11 Delicious.com/paulb/googlerefine OnlineJournalismBlog.com/tag/ google-refineSaturday, 15 October 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×