DATA
CLEANING
INTRODUCTION TO
PRESENTED BY
MILENA MARIN
milena.marin@okfn.org; @milena_iul
Are we ready?
http://openrefine.org/
bit.ly/messydata
What is messy data?
● In groups of 2 or 3 take 10 - 15 minutes to explore the data.
● Write down on post it notes errors you find in the data; anything that
makes your data “messy”
● Example: Numeric values appear in different formats: text, numbers,
etc
Explore your data
• How many columns/ rows?
Tip: use CTRL (CMD on Mac) + cursor key (draw arrows) to explore the
edges of your data
• Understand your column headers (variables)
• What values do these variables take?
Tip: Apply a filter
• What types of data?
Tip: Numbers, text, date, etc.
• Maximum and minimum values
Tip: Use sorting to order your values ascending or descending
Data is messy when….
● Spelling errors (example: the city NY is spelled N.Y. and N.Y)
● White spaces at beginning and end of word
● Dates formatted differently (example: 01/10/2013; 10.2013; October 2013;
01.10.2013 12:00:34)
● Numbers formatted as text (example: £100 can be a number formatted as
currency or a string of text)
○ Hint: numbers are always aligned to the right; text is always aligned to
the left
● Missing values
● 2 or more variables in the same column
● Open-source tool for cleaning and preparing messy data for analysis
● Runs locally but in a web browser
● Formerly a Google product, now an open source project
● I wouldn’t leave home without it!
What is Open Refine?
Microsoft Excel Open Refine
Sorting X X
Removal of white space X X
Splitting columns X X
Convert JSON X
Text faceting X
HTTP requests X
Geocoding X
Reconciliation to API X
Regex matching X
Record of transformation X
Sorting
Remove white spaces
Split columns
Rename columns
Correct formats
Cluster
Cluster
Export
Clean Data!
http://bit.ly/clean_data
Practice
● What are the top 5 initiatives that received largest contributions?
● What about the smallest contributions?
● What is the average contribution?
● Which initiative receives most contributions? What about least
contributions?
● Which party receives most contributions?
● In which cities are the democrats receiving more contributions that the
republicans?

Intro to open refine

  • 1.
    DATA CLEANING INTRODUCTION TO PRESENTED BY MILENAMARIN milena.marin@okfn.org; @milena_iul
  • 2.
  • 3.
    What is messydata? ● In groups of 2 or 3 take 10 - 15 minutes to explore the data. ● Write down on post it notes errors you find in the data; anything that makes your data “messy” ● Example: Numeric values appear in different formats: text, numbers, etc
  • 4.
    Explore your data •How many columns/ rows? Tip: use CTRL (CMD on Mac) + cursor key (draw arrows) to explore the edges of your data • Understand your column headers (variables) • What values do these variables take? Tip: Apply a filter • What types of data? Tip: Numbers, text, date, etc. • Maximum and minimum values Tip: Use sorting to order your values ascending or descending
  • 5.
    Data is messywhen…. ● Spelling errors (example: the city NY is spelled N.Y. and N.Y) ● White spaces at beginning and end of word ● Dates formatted differently (example: 01/10/2013; 10.2013; October 2013; 01.10.2013 12:00:34) ● Numbers formatted as text (example: £100 can be a number formatted as currency or a string of text) ○ Hint: numbers are always aligned to the right; text is always aligned to the left ● Missing values ● 2 or more variables in the same column
  • 6.
    ● Open-source toolfor cleaning and preparing messy data for analysis ● Runs locally but in a web browser ● Formerly a Google product, now an open source project ● I wouldn’t leave home without it! What is Open Refine?
  • 7.
    Microsoft Excel OpenRefine Sorting X X Removal of white space X X Splitting columns X X Convert JSON X Text faceting X HTTP requests X Geocoding X Reconciliation to API X Regex matching X Record of transformation X
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    Practice ● What arethe top 5 initiatives that received largest contributions? ● What about the smallest contributions? ● What is the average contribution? ● Which initiative receives most contributions? What about least contributions? ● Which party receives most contributions? ● In which cities are the democrats receiving more contributions that the republicans?