Who will manage scientific research data?
A study of the emerging astronomy and library data management workforces
UCLA, Department of Information Studies
This study examines how scientists, librarians, and other data managers address the
What scientific research data should be managed?
What is research data management?
What knowledge and expertise are significant to manage research data?
Specifically, how and why are the SDSS and LSST data curated and preserved?
This research is funded by the U.S. National Science Foundation (“Data Conservancy” OCI0830976, S. Choudhury, PI, Johns Hopkins University, and “Knowledge & Data Transfer: the Formation of a New Workforce” #1145888. C.L. Borgman, PI; S. Traweek,
Co-PI) and the Alfred P. Sloan Foundation (“The Transformation of Knowledge, Culture, and Practice in Data-Driven Science: A Knowledge Infrastructures Perspective” #20113194. C.L. Borgman, PI; S. Traweek, Co-PI).
Thank you to the members of the UCLA Knowledge Infrastructures Team, which include: Christine L. Borgman, Peter T. Darch, Sharon Traweek, and Jillian C. Wallis. http://knowledgeinfrastructures.gseis.ucla.edu/ @UCLA_KI
SDSS Image: This is the globular cluster Palomar 5, which is a cluster of stars orbiting the Milky Way at a distance of 210 thousand light years. Most of the fainter stars in the picture belong the to cluster; the brighter stars are foreground stars elsewhere in the
Milky Way. http://www.sdss.org
This study builds on existing interviews, ethnographic participant observation, and document
analysis conducted since Fall 2011. Seven weeks of participant observation and 35 interviews
with key individuals in the SDSS collaboration have already been performed. Future work will be
conducted with LSST collaboration members.
Semi-structured interviews are the primary form of data collection. Interviews include questions
revised from existing team protocols that engage researchers on their data practices,
understanding of data management, archival, and preservation activities, and how they perceive
their data management knowledge and expertise needs.
Participant observation complements interviews to examine if there are important distinctions
between what the interviewees express formally and the observed daily practices. The collected
transcripts, field notes, and documents are analyzed as to how and why practices differ between
and amongst communities.
Examination of the SDSS data transfer has revealed difficulties involved in large-scale data stewardship and the
importance of domain knowledge as librarians and others manage scientific data.
Findings show that what it means to curate and preserve the SDSS data differ between communities. The two libraries
charged with managing the dataset went about the task differently. This study shows that the definition of the dataset
and what it means to curate the data are not agreed upon within or amongst the workforces.
The two libraries tasked with caring for the SDSS data were made up of distinct staff members, each with individual
educational and experiential backgrounds. For example, many of the first library’s staff members hold professional
librarian masters degrees. On the other hand, none of the staff charged with managing the data at the second library
hold the degree, and instead have more experience in information technology and software development.
The education and past work experiences of the staff at the two libraries led to distinct modes of operation when it
came to how to prioritize caring for the SDSS data. The two distinct workforces brought about divergent ways of
managing the dataset.
A large number of people, with diverse educations and work experiences, are involved in SDSS data management:
domain scientists, computer scientists, software and systems engineers, programmers, librarians, and archivists. The
SDSS data management does not fall under the purview of any one kind of expertise. Instead, a variety of backgrounds
and educational histories emerge from the workforce.
The SDSS case study demonstrates that effective data management encompasses multiple tasks, requiring composite
types of expertise, and emerging from multiple workforces.
The Sloan Digital Sky Survey (SDSS) is one of the most ground-breaking surveys in the history of astronomy. The
survey covered over a quarter of the night sky with high quality optical and spectroscopic imaging. The first phase
of the SDSS project (SDSS-I) ran from 2000-2005, the second (SDSS-II) from 2005-2008, and subsequent related
projects continue today. The SDSS data are openly available to astronomers and the general public through data
releases. The SDSS-I/II collection constitutes 130-180 terabytes of astronomical observations.
After finishing data collection, four formal agreements (Memoranda of Understanding) established how the
collection would be cared for over the next five years, concluding January 2014. The data was transferred from a
national laboratory to two different university libraries; it was moved from one kind of workforce to two others.
The Sloan Digital Sky Survey
This project investigates astronomy data practices in three communities:
The Sloan Digital Sky Survey (SDSS) collaboration,
The Large Synoptic Survey Telescope (LSST) collaboration, and
The library and archive workforces partnered with these and other astronomy collaborations.
The study population includes astronomy faculty, students, and staff; library and archive staff;
computer science staff; software engineers and programmers; and administrators from the
nationally distributed institutions involved in the SDSS and LSST.
Research Sites and Population