Slide 1 - IntroGood afternoon everyone, and thank you for joining me today. My name is Rebekah Cummings, and today I’m going to talk about data curation as an opportunity for reinvention within academic libraries.
Last year, when I began exploring this topic for the first time, I had a chance to work with two different research teams to learn about their data practices and help them get some of their data into a suitable archive.
The first group that I worked with was the Digital Youth Network who had just finished a $50 M project developing after school programs for kids. The team at UCLA, however, had absolutely no data for me to look at or work with. All the data were being held in a private server by our collaborator’s co-P.I. at Stanford and were completely inaccessible.
The second group that we worked with was the Civil Rights Project at UCLA. They had lots of data, but the data were in a bunch of different places held by individual researchers on their personal laptops. It took us 10 weeks, but we were finally able to secure 3 datasets from three different researchers. These are not the worst data management stories you could hear by a long shot; they’re just my personal experiences from my first year helping researchers curate their data.
Today we’re going to talk about data management problems in academic research and how data curation can help solve that problem through a mix of technical and social solutions. To demonstrate what this may look like in the real world, I’m going to discuss my own experiences working with the UCLA Civil Rights Project as a model for how librarians can improve data management in research teams through education and outreach. But first, let’s back up a bit to talk about the problem…
As a society, we spend a ton of money on research. NSF – $7 BillionMost of this money goes toward research… small teams…particular questionThis is not the problem; this is a great thing. It shows that as a society we care enough about progress, understanding, and innovation to invest in researchThe problem is that most of the data collected by research teams stays with the research team that collected it.It’s generally not shared with other researchers, it’s not archived in a repository, and it’s not submitted to the granting agency or university that funded the research.
We do have outputs for research and they exist in the form of publications. But not everything about the research can be found in publicationsPublications show findings, methods, distilled or condensed dataNot the same thing as the underlying dataWhat people, and specifically funding agencies, are starting to realize is that data are really, really valuable.
Not only do openly available data provide better transparency, but having access to that data allows researchers to reanalyze data, ask new questions, combineWhen researchers deposit their data into a centralized repository with similar data, we get a corpus of data…
One example of this is Microsoft’s World Wide Telescope. By stitching together data from the best ground based and space based telescopes in the world and combining it with 3-D navigation, this tool allows anyone to explore the universe from their desktops. Aggregations of data, like the WWT, allow us to use data in ways that were previously unimaginable and exposes patterns that are impossible to spot without a critical mass of data.These are all great reasons to share research data, but the reality is that except for a few fields like genomics and astronomy, not much data are being shared.
A recent study done at ICPSR, shows how only a fraction of social science data are archived from studies funded by the NSF and NIH. [Refer to graph]This is in the social sciences (ICPSR fairly well known, ecology less than 1%)In response to this, many agencies, including the NSF, are now requiring data management plans to be included with every research proposal in the hopes that better data management will eventually lead to more data being archived, shared, and reused.
By now, you may be wondering how librarians fit into this equationLibrarians have become the first line of defense in helping academic researchers manage their data. This kind of makes sense. Academic librarians already exist in universities as a resource for information management. We have trust capital from being a part of dependable institutions, and historically, academic librarians have always been in charge of managing and disseminating the scholarly output…And now, data are being seen as a valuable scholarly output alongside books, journals, and other publications.
Which leads us to an obvious question… what is data curation? Data curation is the maintenance and preservation of digital research data throughout its lifecycle. Data curation adds value to data by making it discoverable, accessible, and understandable. Data curators help research teams document their data correctly and archive their data into a trusted repository where it can be indexed and accessed with appropriate user permissions attached.
This is becoming necessary as part of the modern academic infrastructure because, through no fault of your own, generally most scientists…aren’t taught data management as part of their curriculumdon’t know how to create proper metadata aren’t familiar with discipline specific repositoryaren’t convinced they really have to share their dataMost researchers are willing to share under certain conditions such as first rights to publish and attribution. But they don’t know how to preserve their data in a way to make it reusable for other people, and quite frankly, the funding agencies requiring data management plans haven’t given them a lot of guidance.
Before we move on to the social solutions, I think it’s wise to take a moment to touch on the technical solutions that are evolving or already in place. Over the past ten years, technical solutions have developed to enable data sharing, data citation, metadata creation, data indexing, and data archiving.We all know that information management of any kind is not possible without appropriate infrastructure, but data curation in particular has many technical challenges. Some challenges are similar to preserving other types of digital objects. (dig obsol, levels of granularity) Some challenges are different. For instance, one of the hallmarks of data is that data aren’t self-describing. Without rich description data often looks like strings of numbers on a spreadsheet. Another challenge is that data vary greatly in size and complexity. (Excel sheet or a supercomputer) Today I’m focusing on social solutions, but it’s important to note that these technical standards and solutions are still evolving and librarians have been and should be a part of that conversation.
In addition to the technical problems, there are social problems within science that prohibit the sharing of data. Scientists aren’t used to sharing their data except usually by request from trusted colleagues. This slide shows some of the examples that I’ve heard… (laptop crashed, complicated, my data are mine…) In my short time, I’ve found that the best way to combat these social challenges is through education and outreach and changing the culture of science from within research teams.
In the beginning of this presentation, I mentioned working with the Civil Rights Project as part of a class assignment. The Civil Rights Project is a research group at UCLA that studies racial injustices at a regional and national level. The CRP has commissioned over 400 studies, they’ve published 14 books, and they have been cited in Supreme Court cases on affirmative action. Within my 10-week data course, my team learned a lot about the data management problems in the Civil Rights Project and we did manage to get three of their datasets into an archive. But at the end of 10 weeks…So after the class, I contacted Libbie Stephenson in the SSDA and Gary Orfield the director of the Civil Rights Project and asked if they would allow me to do a 240 hour internship to manage the digital assets of the Civil Rights Project.
My team had pinpointed three data management problems within the Civil Rights Project that I chose to focus on for my internship. #1 - The CRP had no digital preservation plan. Some of their data were on the GSEIS server, but most of it stayed with the researchers who had collected the data. Also, the CRP did not keep any of their publications in a digital repository. They self-published their publications on their website without structured metadata or a backup plan in case the website crashed. #2 - The CRP had no centralized repository. If someone called looking for data, Laurie Russman, the CRP project coordinator would have to chase down the researcher who collected the data and get it from them. One time this was particularly problematic when a researcher was out on maternity leave. Lastly, the CRP had no governance plans or workflows for keeping their data organized and archived in the long term. I wanted to help solve these problems by providing the appropriate infrastructure to organize, archive, and share their data in a web-based centralized repository that anyone on the team could access. I also wanted to createa workflow and governance plan so that after I graduated, the data practices wouldn’t go back to business as usual.
My first step toward managing the digital assets of the CRP was to continue on with the Dataverse that my team had set up as part of the CRP data management strategy. Dataverse is a web-based application run by Harvard to share, analyze, and preserve research data. Dataverse was specifically built for social science research data, which worked perfectly for the Civil Rights Project. It uses the DDI metadata standard, which is an XML based schema that can be harvested by data citation indexes like the one Thompson Reuters launched in 2012. Dataverse also provides version control for datasets, persistent identifiers, and recommended data citations to promote best practices in data management. You can even run analytics on Dataverse so that if someone wanted to verify the research on the site, they could actually run simple analysis right there on the site. Most importantly for the CRP, Dataverse also allowed them to set permissions for who can actually view the CRP data. This way if someone calls looking for data, Laurie can simply provide them with a password to the Dataverse site, rather than track down the researcher
The second part of my digital preservation strategy for the CRP was to create an eScholarship site for their publications. eScholarship is an open access scholarly publishing service that’s free to the UC community.eScholarship uses Dublin Core metadata, which allows for faceted searching by author, title, or subject and is also harvested by data service providers. It uses preservation strategies like LOCKSS, provides persistent identifiers, and even links back to the dataverse if the data for a particular publication could be found online.
The CRP director was also excited to see that eScholarship provides usage statistics so that they can tell how often the pubs are viewed or downloaded online. I just launched the eScholarship site in late January and by April the publications on the site had already been viewed or downloaded 784 times so even at its inception, the eScholarship site is off to a really good start.
The last part of my digital preservation strategy for the Civil Rights Project was to create a simple workflow and governance plan that assured that the sites would continue to be used after I graduated. First of all, this meant that through the process I tried to do a bit of missionary work with the CRP researchers and staff. I really tried to appeal to their interests by letting them know that this type of infrastructure can help them secure funding by allowing them to write stellar data management plans, it will make Laurie’s job easier, it will increase their impact factor by making their publication more available. Lastly, it just make their research more transparent. So hopefully that groundwork has been laid. I also created a simple workflow to determine responsibility for submitting the data to Dataverse and the publications to eScholarship so that the governance aspect is clear. This is the workflow that I included in my final report to the CRP.
This was one strategy for changing the culture of data management within a research team. There are other strategies that can be used. Data librarians should offer to teach data management in methods classes, emphasizing that major funders are now requiring data management plans as part of research proposals. We can plan a few hours every week to work in the grant writing offices, helping research teams write the data management plans at the time they are most ready to listen. We can offer to teach workshops or provide individual data consultations with research teams. If you remember the other research team that I mentioned in the beginning of this presentation, the one that developed after school program for kids, the P.I. was so struck by our class project that she later contacted me to come on as a paid consultant for her new research project. I declined the money, but met with them as a consultant as part of my internship. After our consultations, her team was so excited about data management that they organized data management workshops for me and Libbie to teach in the education department. After only a year of this, I can tell you that the need is there. There is an imperative to help our scholars with their data management. We’ve already lost untold amounts of data, some of which can never be collected again. Funding agencies have levied data management requirements that most researchers are currently unprepared to fulfill. While researchers face this very real challenge, librarians can be assets by educating them about data management and using our skills to help solve this problem within our institutions.
Thank you. Questions?
Librarians as data stewards: a case study of the UCLA Civil Rights Project
Librarians as data stewards:a case study of the UCLA CivilRights ProjectRebekah CummingsMay 10, 2013University of California, Los Angeles
Most researchers: Aren’t taught data management Don’t know how to create proper metadata Aren’t familiar with discipline specific repositoriesAren’t convinced they really have to share their dataAdapted from http://www.slideshare.net/carlystrasser/iassist20120608
Improved data transfer and storage Data citation standards Metadata standards Indexing services Trusted repositoriesTechnical Solutions: