One way of looking at Big Data is this graph showing dataset size on the vertical axis against numbers of datasets on the horizontal axis.While there are some very large, celebrated datasets produced by satellites, ocean sensors, etc., there’s a very long tail off to the right of smaller, more obscure datasets that cumulatively account for a large portion of Big Data.
There are many more researchers out in the field collecting heterogeneous data, such as species counts obtained by visual sightings.
And there are many more grants supporting this kind of research...
And those grants are usually much smaller in terms of dollar amounts.
As a result, the large, celebrated datasets tend to come with staff positions for data management, as well as well-supported, standardized software tools.So for a huge number of grants and datasets, especially in Earth, environmental, and ecological sciences, ...
... there’s an ugly truth. Many of these researchers,are not taught data managementdon’t know what metadata arecan’t name data centers or repositoriesdon’t share data publicly or store it in an archivearen’t convinced they should share dataThis amounts to a whole lot of inertia that hasn’t yielded much after years of effort to teach, exhort, cajole, or even scold them about the tools and database systems they should be using ...
... so instead, how about meeting them where they already are, and trying to move forward from there?As it happens, many of them collect data in spreadsheets using Microsoft Excel software. Although it wasn’t designed for research data management, Excel is nonetheless ubiquitous and familiar. Trouble arises because some uses can make data re-use harder.
This led to a vision of extending Excel to make it more friendly for research data.So CDL partnered with MS Research, GBMF, and the DataONE community to gather requirements and build this extension.
The project was to become a tool called DataUp, designed to facilitate Archiving, Sharing, and Publishing.This was to be achieved by enabling and supporting data management and organization.The result would enable and support data re-use and reproducibility.
That was DataUp – the vision. This is DataUp – the implementation, and of course, the Logo.It consists of an open source tool available on bitbucket,plus a community of Earth, environmental, and ecological researchers whose needs were the focus of the first release,plus the DataUp client, which consists of an add-in and a web application.What are these last two things? Good question.
What does it look like? Here’s the add-in.Note the wide ribbon across the top showing extra buttons supported by DataUp.Click on Best Practices check and it will produce an error report like this on the right.
Blowing up that report a bit, it shows you some of errors it found: embedded comments, tables, pictures, color coding, non-contiguous data, missing data, etc.Some of these it can offer to correct by removing. Others it can only point out.
Then there’s the Generate Metadata action, which add a new a tab on the bottom – the red arrow is pointing to its label.This feature supports general description of the entire dataset.The inset on the right shows the web app equivalent. In it you can see red stars indicating the few fields that are required.This first implementation of DataUp supports only EML, the Ecological Metadata Language.
For column, or attribute metadata, in the add-in you can select a set of columns that you want to describe.Again, right now only EML is supported.
An important step is Creation of the Data Citation, which is the primary mechanism to enable discovery and to give credit for producing the data.An important part of that is when you click on “retrieve an identifier”...
You then get asked to select a repository to which you’ll be depositing the data. this first version of DataUp knows about only one repository, CDL’s Merritt-based ONEShare archive after selecting it you’ll see the identifier, in this case an ARK how this works is that it asks the repository (CDL’s Merritt in this case), for an identifier, and that in turn asks CDL’s EZID service for the identifier
Finally you click on Post, and you’ll get an email when your dataset has been registered in the repository.There is actually a practice version of ONEShare that you can select – we strongly recommend that you use that when testing DataUp, as it helps to keep the main repository quality high.
Among all the many researchers receiving those smaller grants I talked about earlier – the long tail researchers – there are those who have or can afford a relationship with an existing data center repository. But within that long tail there’s a long tail of researchers who are unaffiliated with data center or can’t afford it, and we wanted them to be able to use DataUp and be able immediately to deposit in an archive. The long tail of the long tail includes, eg, citizen scientists.
What does the future hold?We plan tobuild a strong community of users and developers,add more repository types,add support for in additional metadata schema.There’s no reason that DataUp couldn’t be extended to any domain that uses spreadsheet data.A high priority is to support more DataONE repositories, which will give DataUp visibility and leverage from the entire DataONE federation,and from there take advantage of connections to the entire NSF DataNet program and global initiatives such as INTEROP.
Transcript of "Big Data's Long Tail"
Big Data’s Long Tail CarlyStrasser, John Kunze, Trisha Cruse University of California Curation Center, California Digital Library 10 December 2012From Flickr by rahenz