I’m a former chemistry researcher who was really bad at the data management game
the first time I played it.
Now I’m a data services librarian who has produced a book, a blog, and videos in this
I want to make the data management game easy and understandable to all players.
This presentation will not only show you tools but also provide tips on leveling up
during the game.
Cloud storage is a great option for the 3-2-1 Rule’s offsite copy.
Not all cloud storage is made equal (read Google Drive’s terms of service). And don’t
rely only on cloud storage for your data (several horror stories here).
Many cloud storage providers offer free storage up to a certain amount, and then it’s a
I like SpiderOak. This is primarily a cloud backup solution, which is less good for file
sharing (other options are available for that).
It’s billed as “zero knowledge” cloud storage. Files get encrypted on your computer
before sending to their servers, meaning the company can’t read your files and they
stay secure when travelling across the internet (this is really important).
I combine this with my local computer and an external hard drive to make my 3 copies.
I don’t use Bulk Rename Utility often, but it’s so useful when I do.
Bulk Rename Utility is free for personal users on Windows.
It allows you to rename a large number of files at the same time (such as when you
have a file naming convention you want to apply to existing files).
The interface looks complicated but that is because it is so powerful.
You can: replace particular characters, add or remove things at a particular position,
easily add numbering or dates, swap parts of the file name around, etc.
It takes a few minutes to learn, but it’s a great tool to have in your back pocket.
Regular expressions (regex) are an amazing tool for search and replace.
Regex doesn’t stand alone, but rather plugs into other tools like Bulk Rename Utility,
notepad++, Java, etc.
Regex works by pattern matching, allowing you to search for all social security numbers
in a document, reformat any phone numbers, change the order of sections in a
document but keep the text the same, etc.
Regex takes a bit more learning but is incredibly useful for anyone doing text
manipulation or clean up.
The first link on this slide is to a tutorial I like.
The second link is to a tool, RegExr, that allows you to test your written regular
expressions against text.
Versioning files by hand takes up a lot of hard drive space.
A version control system, like Git, only saves the differences between one version and
the next instead of the whole file. It also streamlines the versioning process.
Such tools came out of computer science but are being used by many researchers.
Git is free and open source.
Git is different than GitHub – Git basically handles the version control, while GitHub
hosts the files and versions and can make them available to others.
Git is really useful but has a learning curve. Because of that, I recommend starting with
the GUI version unless you are comfortable with the command line.
Excel is a useful tool but isn’t always the best tool for cleaning data.
It’s especially bad with dates and tends to mangle them.
OpenRefine is a free, open source tool that was previously known as GoogleRefine.
It is the best tool for cleaning up tabular data.
OpenRefine can break data down by “facet” (variable values or ranges), allowing you to
do quick parsing, counting, or editing.
Editing includes straight replacement, math, basic text manipulation (uppercase to
lowercase, etc.), or other functions using Google Refine Expression Language (GREL).
You can also break multi-component cells apart or combine them into one.
The tool also allows for text clean up, providing a number of different algorithms for