Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to open refine

1,442 views

Published on

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Intro to open refine

  1. 1. DATA CLEANING INTRODUCTION TO PRESENTED BY MILENA MARIN milena.marin@okfn.org; @milena_iul
  2. 2. Are we ready? http://openrefine.org/ bit.ly/messydata
  3. 3. What is messy data? ● In groups of 2 or 3 take 10 - 15 minutes to explore the data. ● Write down on post it notes errors you find in the data; anything that makes your data “messy” ● Example: Numeric values appear in different formats: text, numbers, etc
  4. 4. Explore your data • How many columns/ rows? Tip: use CTRL (CMD on Mac) + cursor key (draw arrows) to explore the edges of your data • Understand your column headers (variables) • What values do these variables take? Tip: Apply a filter • What types of data? Tip: Numbers, text, date, etc. • Maximum and minimum values Tip: Use sorting to order your values ascending or descending
  5. 5. Data is messy when…. ● Spelling errors (example: the city NY is spelled N.Y. and N.Y) ● White spaces at beginning and end of word ● Dates formatted differently (example: 01/10/2013; 10.2013; October 2013; 01.10.2013 12:00:34) ● Numbers formatted as text (example: £100 can be a number formatted as currency or a string of text) ○ Hint: numbers are always aligned to the right; text is always aligned to the left ● Missing values ● 2 or more variables in the same column
  6. 6. ● Open-source tool for cleaning and preparing messy data for analysis ● Runs locally but in a web browser ● Formerly a Google product, now an open source project ● I wouldn’t leave home without it! What is Open Refine?
  7. 7. Microsoft Excel Open Refine Sorting X X Removal of white space X X Splitting columns X X Convert JSON X Text faceting X HTTP requests X Geocoding X Reconciliation to API X Regex matching X Record of transformation X
  8. 8. Sorting
  9. 9. Remove white spaces
  10. 10. Split columns
  11. 11. Rename columns
  12. 12. Correct formats
  13. 13. Cluster
  14. 14. Cluster
  15. 15. Export
  16. 16. Clean Data! http://bit.ly/clean_data
  17. 17. Practice ● What are the top 5 initiatives that received largest contributions? ● What about the smallest contributions? ● What is the average contribution? ● Which initiative receives most contributions? What about least contributions? ● Which party receives most contributions? ● In which cities are the democrats receiving more contributions that the republicans?

×