TXDHC OpenRefine Training

Intro to Open Refine
An overview & walkthrough to get you started.

 intro/overview (15 min)
 walkthrough (45 min)
 intro to advanced (10 min)
 q&a (20 min)
http://www.txdhc.org/txdhc-training-webcast-materials/

“a tool for working
with messy data”

Cleaning up data that is:
 in a simple tabular format
 is inconsistently formatted
 has inconsistent terminology

 get an overview of a data set
 resolve inconsistencies
 split data up into more granular parts
 match local data up to other data sets
 enhance a data set with data from
other sources

https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/text-facet-openrefine.png

https://cms-assets.tutsplus.com/uploads/users/199/posts/20843/image/clustering-openrefine.png

Freebase Gridworks
=
GoogleRefine
=
OpenRefine
=
Refine

…ask some questions about your data set:
 What type of data is it & what format is it in?
 What’s the size of your data set?
 What question do you want to ask your data?
 What do you need to do to find the answer?

Excel
familiarity, better for data entry, cut and paste
operation, no paging to navigate
Google Spreadsheets
similar to Excel, can get external data
relatively easily, easy to collaborate and share
Google Fusion Tables if you just want to filter, easy to share
Text editor powerful text editor can do many things
Unix tools
more challenging to use, but quick and some
things (finding things, sorting) are easy
Writing code most sophisticated and most to learn!

Regular expressions
 “wildcards on steroids” that allow for
more granular data manipulation
(http://www.regular-expressions.info)

Transformations using Open Refine
Expression Language (GREL)
 kind of like a formula in Excel

Retrieve data from online sources
 example: use names to retrieve birth/death dates
from Virtual International Authority File (VIAF)
Match data to external data sources using
 Extensions for RDF, DBpedia, Named-Entity
Recognition (NER), etc…
 And ‘reconciliation’ services

Use ‘cross’ function to compare
contents of two Refine projects, or
share data between the two projects.

 TxDHC blog post on this webinar http://www.txdhc.org/txdhc-training-
webcast-materials/
 The OpenRefine Wiki https://github.com/OpenRefine/OpenRefine/wiki
 OpenRefine User Documentation
https://github.com/OpenRefine/OpenRefine/wiki/Documentation-For-Users
 The ‘Free your metadata’ site http://freeyourmetadata.org...
 …and book http://book.freeyourmetadata.org
 The OpenRefine mailing list and forum
http://groups.google.com/d/forum/openrefine

http://bit.ly/1uGPd0f
Please email us if you have any questions:
Jennifer = jenniferraehecker@gmail.com
Liz = egrumbac@tamu.edu

credits * acknowledgements * citations
These slides were developed by Jennifer Hecker (j.hecker@Austin.utexas.edu) and Liz Grumbach (egrumbac@tamu.edu )
on behalf of University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, Media and Culture, and the Texas
Digital Humanities Consortium using many resources including the wonderful course material developed by Owen
Stephens on behalf of the British Library (http://www.meanboyfriend.com/overdue_ideas/2014/11/working-with-data-
using-openrefine/).
Unless otherwise stated, all images, audio or video content are separate works with their own license, and should not be
assumed to be CC-BY in their own right. This work is licensed under a Creative Commons Attribution 4.0 International
License http://creativecommons.org/licenses/by/4.0/. It is suggested when crediting this work, you include the phrase
“Developed by Liz Grumback and Jennifer Hecker on behalf of the university of Texas, Texas A&M, and the TXDHC.”
Thanks to University of Texas Libraries, Texas A&M’s Initiative for Digital Humanities, and the Texas Digital Humanities
Consortium for facilitating this presentation.

TXDHC OpenRefine Training

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to TXDHC OpenRefine Training

Similar to TXDHC OpenRefine Training (20)

Recently uploaded

Recently uploaded (20)

TXDHC OpenRefine Training

Editor's Notes