Presented by Rachel Tillay and Mike Waugh, LSU Libraries
Is your data running loose in your library? OpenRefine is a tool that can help libraries more easily view, analyze, clean, and match large data sets. It is particularly useful for digital projects, statistics, or user data. This presentation will describe how OpenRefine is different from spreadsheets, datasheets, and programming. It will also include demonstrations of some of the most useful functions in OpenRefine, compatible tools, and solutions it provides to would-be data wranglers. Examples will include real-life problems that LSU Libraries has encountered in its cataloging and digital projects.
While I learned about OpenRefine about two and a half years ago, I wasn’t exactly sure how to incorporate it in my day to day workflow, so it took me a while to warm up to it. As I’ve become more acquainted with it, I have found just knowing more about what it can do makes it more and more a go-to tool when I’m troubleshooting or working on projects. (A little like MarcEdit was with me a few years ago).
This is not a step by step tutorial on how to use OpenRefine, although I will cover many of its features. Nor will I cover the installation. Mainly, I am going to introduce some of the tools it includes and Rachel will show how digital services has used many of these tools on data for the Louisiana Digital Library.
David Huynh, one of the developers of what was then called Google Refine, has a short, but definitive tutorial on OpenRefine.
It’s the tool formerly know as Google Refine. In fact, it still says Google Refine in the header when you install it.
Importing and exporting is important because it is meant to work with other tools. Use it in addition to your other tools. For the systems librarian, this would be the ILS, MarcEdit, Excel, and web services. Faceting data means slicing data in different ways to analyze it, or group it together for transformation Clustering groups together similar data, allows you to combine them easily Reconciliation allows you to use outside data sets, available as web services, to transform or link your data
These functions and examples will form the body of the rest of this presentation.
Create a project by importing data. What kinds of data files can I import? TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported. Support for other formats can be added with Google Refine extensions.
You can also open existing projects or import projects that have been worked on and exported from another OpenRefine instance.
Note the Preview above shows you what you are doing while you’re manipulating the data (a common theme in OpenRefine). With many text-based projects, I recommend unchecking “Quotation marks are used to enclose cells containing column separators.”
Logs have different types of queries. It looks like these are the ones that would be useful to look out. I don’t need the Login rows
Here, I’m creating a new column based on the value of an existing column. Similar to Excel, but uses regular expressions. It’s called Google Refine Expression Language. There’s online documentation to use it. The best thing is that you can see the results of the transform in real time, which aids greatly in creating the expression. I haven’t figured out how to edit the expression once created, so I need to copy it and keep it in my notes.
My crazy insane GREL expression was not foolproof apparently. I can fix the expression or fix the rows manually
Apply to all identical cells. Can also edit from the facet box too.
Might decide to remove the searches with Boolean searches
The field names are the same here as the names of the columns in my original spreadsheet. The number of rows and records are equal (763) to the number of rows in my spreadsheet. In short, the data is virtually unaltered.
Many of these fields are good for faceting. Fields that are helpful for faceting include ones where a small number of terms are used. I’ve also found in helpful to facet fields in which every term should be unique, like call numbers. However, if you have fields that contain multiple terms, like Style does, you’re going to want to split the cells before faceting.
You can also use | , or any other delimiter to split the fields.
Edit and apply will change all of the records that use this term like an “apply all.”
Selecting a term in facet will bring up all the records with that term. In this case, this is the record (with multiple associated rows) that contains Hawaiian Guitar. This is one way for you to only edit some of the items with a particular vocabulary term, but you can also use this method and change all associated records with “Apply to All…”
Clustering is essentially a method for matching your data to itself.
Options under Method include key collision and nearest neighbor. Options under Keying Function include fingerprint, ngram-fingerprint, metaphone3, and cologne-phonetic. I recommend trying all of them, because you never know which is going to be most effective with your current data. I’ve found interesting clusters using each of them. Clustering before you reconcile is helpful because it allows you to reconcile more effectively.
This is the filepath for extensions in Windows. See pg. 70 of Using OpenRefine for details about adding a reconciliation service on Mac or Linux
Select the field you want to reconcile; “Start reconciling;” and then select the reconciliation service you want to use for that field. As you reconcile, the green bar at the top will show you the percentage of terms that have been matched.
Terms will be easier to match if they already use the same exact language as the vocabulary you want to match (so, I changed Holiday music to Holidays—Songs and music). Using the single checkmark will match this one item; using the double will match all the identical terms. You can also search for a match and it will show you a number of related terms from the same vocabulary (in this case LCSH).
Edit to reflect Austin’s whole process.
If the field contains items that don’t have a URL/couldn’t be reconciled, you may want to store the error. Then you can facet on this error and use fill down to fill the column with placeholder data.
The result is a column with working links to LCSH. Next, I recommend clearing the reconciliation data, and using edit cells>join multi-valued cells to combine the rows for your records until the number of rows and records are equal (in this case you’ll be back to the 763 you had originally).
Renaming the .tsv as a .txt allow you to open it with any application that can open text files. My computer defaults to opening text files in Excel, so that’s what you see here. Two-thirds of the way across you can see the StyleURLS field that I created. We can now use this file to create XML, which we can edit with XSLT, and we will eventually have a MODS document for each record that include the new linked data. We will simultaneously be able to clean and audit the data (as you’ve seen). This is going to greatly improve not only the quality of our data, but is much more efficient that any other methods.
Data Wrangling with Open Refine
Data Wrangling with Open Refine
Rachel Tillay (Digital Projects Librarian)
Mike Waugh (Systems Librarian)
Introduction to Open Refine
What is it?; Are there other similar tools?; What can it do?
What the heck is OpenRefine?
More powerful than a spreadsheet
More interactive and visual than scripting
More exploratory/ experimental/ playful
than a database
But, really, what is it?
Formerly known as Google Refine
Like Excel but different
It’s a Interactive Data Transformation tool (IDT)
What the heck is an Interactive Data Transformation tool?
“Interactive Data Transformation tools (IDTs) now allow for quick and
inexpensive operations on large amounts of data inside a single
integrated interface, even by domain professionals lacking in-depth
technical skills. OpenRefine is such an IDT; a tool for visualizing and
manipulating data.” (Verborgh and De Wilde, 6)
How do other tools compare to OpenRefine?
Cells are the units of interaction
Used for entering data and performing
You only see input and output data
Used for transforming data
Requires effort to design schemas
Data is hidden, unless programming is done
to expose views
Columns are the primary units of interaction.
Used for exploring and transforming data.
You see the data interactively at each step, as
you edit and transform it
Used for understanding, then transforming data
No schema required, like spreadsheets
Data is always visible
What can OpenRefine Do?
Import and export data
Use regular expressions to match and transform data
Cluster together similar data
Reconcile data to outside data sets
Use Case: Clean Controlled Vocabulary
Facet data in selected fields
Assess percentages that conform to your schema
Use “edit” and “apply” or “apply to all identical cells”
Number of rows and
Controlled vocabulary fields are
good for faceting
Splitting multi-valued cells
requires that there be a regularly
occurring division between terms
CONTENTdm uses “; ”
Now rows =/= records; each
record will have one or more
Faceting Results Before and After
Facet Before Split Facet After Split
Correct Faceted Metadata Option 1
One “Vocal and Instrumental” has an
Many extra blanks from other split
Search for match
Not every service will include a term for
each item in a field
Skip reconciling a term
Reconcile a term with a different service
(Create new topic)
Use Case: Export Data for Reuse
Split cells or columns to add values
Export to text files, Excel, or RDF?
Export project to share a project’s data and
Simple exports like tab-separated
Custom tabular exporter…
RDF as RDF/XML
RDF as Turtle
Export a Project
Export project to share an
Download and save
Select “Import Project” and
browse for saved file
Should show you the entire
project history under the
Bedoya, Jaclyn. (2014). Analyzing Usage Logs with OpenRefine. Accessed
March 16, 2015 at http://acrl.ala.org/techconnect/?p=4253
Huynh, David. (2011). Google Refine Tutorial. Accessed March 11, 2015 at
Verborgh, R., & De Wilde, M. (2013). Using OpenRefine : The Essential
OpenRefine Guide That Takes You From Data Analysis and Error Fixing to
Linking Your Dataset to the Web. Birmingham, UK: Packt Publishing.