DATA
JOURNALISM
Dr. Bahareh Heravi
@Bahareh360
Week 8

Cleaning and Analysing Data
 
DATA	
  is	
  o(en	
  ugly	
  	
  
&	
  
MESSY	
  
Data Profiling
Assess current state of your data.
Data Cleaning
Correct the issues you found during‘data profiling’.
Exploring data
Checking data
Filtering data
Cleaning data
Reshaping data
Annotating data
Linking data
Dataset
Powerhouse Museum objects collection
Download from:
http://data.freeyourmetadata.org/powerhouse-
museum/phm-collection.tsv
Open Refine and load the dataset.
Sorting data
Faceting data
To select a subset of your data to work on.
To get useful insight into your data.
To apply a transformation to a subset of your data.
Types of Facets

Text facets for text
Numeric facets for number and dates
Predefined/customised facets
Text facets

Text facets used for faceting text
Examples:County or city names, TD names
Text facets
Numeric facets

Numeric facets used for faceting numerical values
and ranges.
Examples:Expenditure,crime rate
Numeric facets
Detecting blanks
Removing blanks
Detecting duplicates
Removing duplicates
Warning:
If we remove all the original records will also be
removed!
Removing duplicates
Removing duplicates
Now you can remove.	
  
Facet by blank	
  
Congratulations you have removed
all blank and duplicate values.
Simple cell transformations
Advanced data operations
Clustering
Transformations
Multi-valued cells
Derived columns
Splitting data across columns
Regular Expressions
GREL(General Refine Expression Language)
Multi-valued cells
To split a cell in
Clustering
To cluster similar (syntactically) items together.
To be used to fix inconsistencies,typos,etc.
Examples in the dataset: Agricultural equipment 
Agricultural Equipment
Costume 
Costumes
Clustering
Clustering
Transforming cell values
Transforming cell valuesGREL	
  	
  
(General	
  Refine	
  Expression	
  Language)	
  
Resources
Using OpenRefine by 
RubbenVerborgh and Max DeWilde
http://freeyourmetadata.org/cleanup/
Cleaning Data with Refine, School of Data
The Bastard Book of Regular Expressions by Dan Nguyen
GREL:https://github.com/OpenRefine/OpenRefine/wiki/General-Refine-Expression-Language
 
Ques8ons?	
  
	
  
Bahareh	
  R.	
  Heravi	
  
	
  
	
  
	
  
@Bahareh360	
  
	
  
	
  
	
  

Data Journalism - Cleaning Data