Are we ready?
What is messy data?
● In groups of 2 or 3 take 10 - 15 minutes to explore the data.
● Write down on post it notes errors you find in the data; anything that
makes your data “messy”
● Example: Numeric values appear in different formats: text, numbers,
Explore your data
• How many columns/ rows?
Tip: use CTRL (CMD on Mac) + cursor key (draw arrows) to explore the
edges of your data
• Understand your column headers (variables)
• What values do these variables take?
Tip: Apply a filter
• What types of data?
Tip: Numbers, text, date, etc.
• Maximum and minimum values
Tip: Use sorting to order your values ascending or descending
Data is messy when….
● Spelling errors (example: the city NY is spelled N.Y. and N.Y)
● White spaces at beginning and end of word
● Dates formatted differently (example: 01/10/2013; 10.2013; October 2013;
● Numbers formatted as text (example: £100 can be a number formatted as
currency or a string of text)
○ Hint: numbers are always aligned to the right; text is always aligned to
● Missing values
● 2 or more variables in the same column
● Open-source tool for cleaning and preparing messy data for analysis
● Runs locally but in a web browser
● Formerly a Google product, now an open source project
● I wouldn’t leave home without it!
What is Open Refine?
Microsoft Excel Open Refine
Sorting X X
Removal of white space X X
Splitting columns X X
Convert JSON X
Text faceting X
HTTP requests X
Reconciliation to API X
Regex matching X
Record of transformation X
● What are the top 5 initiatives that received largest contributions?
● What about the smallest contributions?
● What is the average contribution?
● Which initiative receives most contributions? What about least
● Which party receives most contributions?
● In which cities are the democrats receiving more contributions that the