Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Wrangling with Open Refine

1,942 views

Published on

Presented by Rachel Tillay and Mike Waugh, LSU Libraries
Is your data running loose in your library? OpenRefine is a tool that can help libraries more easily view, analyze, clean, and match large data sets. It is particularly useful for digital projects, statistics, or user data. This presentation will describe how OpenRefine is different from spreadsheets, datasheets, and programming. It will also include demonstrations of some of the most useful functions in OpenRefine, compatible tools, and solutions it provides to would-be data wranglers. Examples will include real-life problems that LSU Libraries has encountered in its cataloging and digital projects.

Published in: Technology
  • Be the first to comment

Data Wrangling with Open Refine

  1. 1. Data Wrangling with Open Refine Rachel Tillay (Digital Projects Librarian) Mike Waugh (Systems Librarian)
  2. 2. Introduction to Open Refine What is it?; Are there other similar tools?; What can it do?
  3. 3. What the heck is OpenRefine?  More powerful than a spreadsheet  More interactive and visual than scripting  More exploratory/ experimental/ playful than a database (Huynh, 2)
  4. 4. But, really, what is it?  Formerly known as Google Refine  Like Excel but different  It’s a Interactive Data Transformation tool (IDT)  What the heck is an Interactive Data Transformation tool?  “Interactive Data Transformation tools (IDTs) now allow for quick and inexpensive operations on large amounts of data inside a single integrated interface, even by domain professionals lacking in-depth technical skills. OpenRefine is such an IDT; a tool for visualizing and manipulating data.” (Verborgh and De Wilde, 6)
  5. 5. How do other tools compare to OpenRefine? Other Tools Spreadsheets  Cells are the units of interaction  Used for entering data and performing calculation Scripting  You only see input and output data  Used for transforming data Databases  Requires effort to design schemas  Data is hidden, unless programming is done to expose views OpenRefine  Columns are the primary units of interaction.  Used for exploring and transforming data.  You see the data interactively at each step, as you edit and transform it  Used for understanding, then transforming data  No schema required, like spreadsheets  Data is always visible (Huynh, 2)
  6. 6. What can OpenRefine Do?  Import and export data  Facet data  Use regular expressions to match and transform data  Cluster together similar data  Reconcile data to outside data sets
  7. 7. Functions 1. Importing/Exporting 2. Faceting 3. Transforming 4. Clustering 5. Reconciling
  8. 8. Use Cases 1. Analyze e-Library search logs 2. Clean Controlled Vocabulary 3. Match or Link Data 4. Export Data for Reuse
  9. 9. Use Case: Analyze e-Library search logs  Normalize funky data  See general trends
  10. 10. The basic process: 1. Download logs from server 2. Transform data in OpenRefine 3. Analyze data in Excel or another statistical analysis tool FileZilla Client OpenRefine Excel
  11. 11. Import Options and Preview
  12. 12. Query Results
  13. 13.  Choose “text facet”  Values: 1. ATSQuery 2. Login 3. UFSQuery Basic Faceting
  14. 14.  Choose the Login Value Get 14,204 matches  Remove all matching rows Use Faceting to Remove Data
  15. 15.  Create a column based on column  Use GREL expression to derive it Use Regular Expressions to Derive Data
  16. 16. Facet Results to Find Errors  Take a look at the values in the new column  Clean up the few oddballs
  17. 17.  Edit the cells individually  “Apply to All Identical Cells” Edit Faceted Data
  18. 18.  Make a column called “Search Terms”  Filter by AU facet in index column Filter AU Terms
  19. 19.  Cluster—groups similar terms together Cluster Data
  20. 20.  These are the top author searches for the month Results of Derived and Cleaned Data
  21. 21.  Apply Text filter  Search AND or AND TI
  22. 22. Use Case: Clean Controlled Vocabulary  Facet data in selected fields  Assess percentages that conform to your schema  Use “edit” and “apply” or “apply to all identical cells”
  23. 23. Imported Data  Field names  Number of rows and records equal  Unaltered data
  24. 24. Faceting Data  Controlled vocabulary fields are good for faceting  Examples Medium Medium Notes Instrumentation Style Key Signature Name fields
  25. 25. Splitting Data  Splitting multi-valued cells requires that there be a regularly occurring division between terms  CONTENTdm uses “; ”  Now rows =/= records; each record will have one or more rows
  26. 26. Faceting Results Before and After Facet Before Split Facet After Split
  27. 27. Correct Faceted Metadata Option 1  One “Vocal and Instrumental” has an extra space  Many extra blanks from other split records
  28. 28. Correct Faceted Metadata Option 2
  29. 29. Use Case: Match or Link Data  Clustering  Add a thesaurus like VIAF  “Reconcile” to the thesaurus  “Apply” or “apply to all identical cells”
  30. 30. Clustering
  31. 31. Adding a Reconciliation Service  Download Extension  Copy contents of Zip File into extensions folder  Restart the application
  32. 32. Reconciliation Process
  33. 33. Reconcile Results  Reconcile with clustered data  Search for match  Not every service will include a term for each item in a field Skip reconciling a term Reconcile a term with a different service (Create new topic)
  34. 34. Use Case: Export Data for Reuse  Split cells or columns to add values  Concatenate  Export to text files, Excel, or RDF?
  35. 35. Export Options  Export project to share a project’s data and history (undo/redo)  Simple exports like tab-separated  Advanced options:  Triple loader  MQLWrite  Custom tabular exporter…  Templating…  RDF as RDF/XML  RDF as Turtle
  36. 36. Export a Project  Export project to share an entire project  Download and save  Select “Import Project” and browse for saved file  Should show you the entire project history under the undo/redo tab
  37. 37. Simplify Complex Records for Export
  38. 38. Combine Data into Single Row Records
  39. 39. Export to Simple Data Format
  40. 40. Conclusion
  41. 41. Recommended Reading  Bedoya, Jaclyn. (2014). Analyzing Usage Logs with OpenRefine. Accessed March 16, 2015 at http://acrl.ala.org/techconnect/?p=4253  Huynh, David. (2011). Google Refine Tutorial. Accessed March 11, 2015 at http://davidhuynh.net/spaces/nicar2011/tutorial.pdf  Verborgh, R., & De Wilde, M. (2013). Using OpenRefine : The Essential OpenRefine Guide That Takes You From Data Analysis and Error Fixing to Linking Your Dataset to the Web. Birmingham, UK: Packt Publishing.

×