"Wrangling Data With OpenRefine" We've all found datasets that hold the promise or riches inside them, but the data itself is in a terrible state. Commas and percentage signs that mess up number formats, dates that we can see are dates but the computer can't, lists of items in single cells, data spanning multiple rows where we want pivot one column's values across multiple columns, and so on. OpenRefine is a powerful browser based tool that helps you work with tabular data and tidy it up as you go along. As well as editing functions (and yes, you will get to see some regular expressions string matching action!) we can also cluster similar words and phrases, spotting that "J. Smith & Sons Ltd" is probably the same as "J Smith and Sons Limited" and being able to group them as such. Once you've seen it in action you'll wonder why working with data ever needed to be so difficult before.
45 mins x 3
Dirty in the sense that the data may contain errors; messy in that it may be in the wrong shape, form or layout. (Not messy in statistical sense – statisticians are well know for the way they appropriate words and make them mean something no-one else means. In stats, messy data is often regarded as data that is not normally distributed).
OpenRefine as a tool helps us work with data that can be represented in tabular form, and generally just tidy it up. Open Refine can help us with inconsistently formatted data, or data that is incorrectly represented. Open Refine can help us get an overview of a dataset, as well as splitting it up in a variety of ways. And Open Refine can help us annotate our data with data from an external source.
Wrangling Data with
Computing and Communications
The Open University
“It’s … a great joy to learn a technique,
because as soon as you learn it, you
start thinking in it. When I learn a new
technique my imaginative possibilities
Playing to the Gallery