Beautiful Research Data (Structured Data and Open Refine)

  • 89 views
Uploaded on

http://serai.utsc.utoronto.ca/rrsi2014 …

http://serai.utsc.utoronto.ca/rrsi2014

"Unlike traditional academic conferences, the Roots & Routes Summer Institute features a combination of informal presentations, seminar-style discussions of shared materials, hands-on workshops on a variety of digital tools, and small-group project development sessions. The institute welcomes participants from a range of disciplines with an interest in engaging with digital scholarship; technical experience is not a requirement. Graduate students (MA and PhD), postdoctoral fellows and faculty are all encouraged to apply."

More in: Software , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
89
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Take Dragomans File and load it into
  • padani example
  • Difficult to install on Windows?
  • Here I have launched OpenRefine in my browser. The sample file I’m using is located at the url on the slide. Remember that a longer version of this tutorial is available at http://programminghistorian.org/lessons/cleaning-data-with-openrefine

Transcript

  • 1. Beautiful Research Data Kirsta Stapelfeldt, Coordinator UTSC Library’s Digital Scholarship Unit
  • 2. In this presentation ● Part One: preparing to create machine- readable data at the onset of a research endeavour ● Part Two: Working with “messy,” datasets
  • 3. Benefits of machine-readable data ● Easier to query for new insights ● Easier to mount in a computing environment ● Easier to share with others
  • 4. Just a .csv + Fusion Tables ● Fusion tables is an experimental, web-based chrome app ● Took a spreadsheet that Natalie has been working on and loaded it into the app ● Results have not been massaged at all ● We can expect additional benefits from having structured data in the future
  • 5. Part one In which you have no research data...yet
  • 6. Best Case Scenario You start by utilizing some best practices 4 Pieces of low-hanging fruit...
  • 7. 1. No word documents ● database (even a spreadsheet) not .docs ● avoid a lot of style information in your research documents (such as bolding and italicizing text, or moving things to other areas of the page using the tab key or spacebar) ● Why?
  • 8. Look beyond the surface. & n &nsbp; &nsbp; &nsbp; &nsbp; no thank you! http://www.bartleby.com/103/33.html
  • 9. Beauty is more than browser deep http://www.gutenberg.org/ebooks/18827
  • 10. 2. Use consistent formats for elements such as date & language ● i.e. dates recorded consistently where possible (05/25/2014)
  • 11. 3. Taxonomies & Standards ● use controlled vocabularies for keywords, place names, person names of relevance o using an open format for a place name can make geocoding much easier o stay consistent in a given language
  • 12. 4. Text Encoding ● Ensure you are using Unicode (UTF-8) ● How do you know ? o Notepad can be your friend o Test a sample between systems
  • 13. http://www.string-functions.com/encodingerror.aspx
  • 14. Changing the way you think about your research process Draw a picture
  • 15. 1. Think small. Atomistic information (what is the smallest meaningful unit of information you are collecting?) For example: ● A person’s name, religion, and DOB ● Mention of a location or name ● Repeated occurrence
  • 16. 2. Connect the dots. What are the relationships between your data elements? Useful tool: The Entity Relationship Diagram
  • 17. Draft Dragomans Content Model
  • 18. Crow’s Foot Notation Exercise - Building an ERD
  • 19. Part two Your data is a mess
  • 20. Tools for dealing with messy data ● Regular Expressions ● Open Refine
  • 21. Regular Expressions: Find & Replace on Steroids ● Available in most productivity suites (iWork, Microsoft Word, Libre Office/Open Office) ● Often syntax is a little different across systems
  • 22. “The regular expression (?<=.) {2,}(?=[A-Z]) matches at least two spaces occurring after period (.) and before an upper case letter as highlighted in the text above.”
  • 23. Open Refine ● Similar to spreadsheet software ● Installed on your computer, but used through your browser ● “Power Tool” for messy data Following will draw heavily from this lesson - http://programminghistorian.org/lessons/cleaning-data- with-openrefine (Thanks to Seth van Hooland, Ruben Verborgh, Max De Wilde)
  • 24. Base Assumption of Open Refine ● You have “structured data” ● some consistent and machine-readable logic has been applied to your data o Excel, .csv, XML ● you may have structured data and not know it o Check export options from any software you regularly use
  • 25. 1. Remove duplicates 2. Remove blanks 3. Make data atomistic (smallest meaningful unit) 4. Keep terms/formats consistent
  • 26. http://data.freeyourmetadata.org/ powerhouse-museum/phm- collection.google-refine.tar.gz Choose file & select Next...
  • 27. Set appropriate options and “Create Project”
  • 28. Project is created with 75,814 rows.
  • 29. 1. Look for Blank Records See if any RecordIDs are blank by using a numeric facet
  • 30. “Non-numeric” rows are blank.
  • 31. Hovering over the cell makes an “edit” link visible
  • 32. The “blank” fields actually contained a single whitespace. You can delete the whitespace and then select “Apply to All Identical Cells” -
  • 33. A confirmation message will always show up noting what you’ve done, and giving you a chance to “undo”
  • 34. 2. Look for Duplicate Records using Record ID (since it should be unique)
  • 35. Sorting is a visual tool only unless you “Reorder rows permanently”
  • 36. “Blank down” will delete the second instance of a duplicated “Record ID”
  • 37. Then, we can facet the “Record ID” column by blank records.
  • 38. the “true” facet contains all the blank records.
  • 39. Clicking the “true” link will narrow to the blank records, which can then be removed.
  • 40. 3. Make data atomistic “Category” contains numerous categories separated by the “|” character
  • 41. You can tell the system to split the cells using this character.
  • 42. Now only single categories appear.
  • 43. Creating a text facet on “Categories” brings up all the options in this column. We can “cluster” to detect similar terms that might have variances in spelling or capitalization 4. Make terms consistent
  • 44. This interface allows you to select which term is authoritative. You can then merge terms together.
  • 45. a couple of additional features... The “Undo/Redo” tab allows you to back up in steps to the creation of your project, if you make a mistake.
  • 46. A “text filter” can allow you to search in a column (by regular expression too!)
  • 47. Refine has its own set of regular expressions that can be used to perform functions on data.
  • 48. https://github.com/OpenRefine/OpenRefine/wiki/GREL- Functions A full list of these is available on Github.
  • 49. Finally, projects can be exported as Refine projects, but also in a number of additional structured formats. Do this frequently.
  • 50. Structured data is beautiful data. Make a plan to create structured data during your research Clean legacy data or data you inherit, by becoming a regular expression (regex) expert and/or using a tool like OpenRefine.
  • 51. Go to your library or ITS department to see if you can get support. Thanks for listening to me!