A metadata scheme of the software-data relationship: A proposal

College of Computing and Informatics, Drexel University
kl696@drexel.edu
Putting data back to its contexts is an important perspective to understand how data
contributes to knowledge production. From this perspective, as put by Geoffrey Bowker,
"raw data is both an oxymoron and a bad idea" (Bowker, 2005). As important as data is in
the research activities nowadays, it is never a singular end product that can be
unquestionably equated to knowledge and facts: data has no value or meaning if it is
isolated from other objects, such as people, practice, technologies, and other material
objects and relationships (Borgman, 2015).
Scientific software is an important object in the list above that has deep interactions with
data: it is impossible to do any processing and calculation of data without software or
algorithm in this era of digital data. But their relationship is much more entangled than
this obvious fact:
• First, despite their "usual" different roles and characteristics in research lifecycles (Katz
et al., 2016), software is data in nature, and there are strong benefits if software is
curated and preserved like datasets (Lynch, 2014; Marcus & Menzies, 2010).
• Second, to make things even more complicated, there are increasingly more software
packages that are datasets per se or contain datasets.
• Last but not least, simulation data, data generated through the implementation of
models, has been an important type of research data in contemporary scientific
practices.
All these points suggest a mutually reliant relationship between data, software, and
research method in theory and highly diverse possibilities in practice. Edwards (2010)
discussed how the very process of collecting data inevitably involves assumptions of what
the datasets should be like and the activities to validate these assumptions, or like what he
said, “virtually everything we call ‘global data’ is not simply collected; it is checked,
filtered, interpreted, and integrated by computer models” (page 188). Based on this
notion, we would like to put research method in the same sphere as data and software
and examine how software and research methods are involved in every stage of the
lifecycles of research data and research activities. This conceptual framework is illustrated
in Figure 1.
Conceptual backgrounds
Software and method information connected to research data can make contributions to the
following two areas in scholarly communication:
• It increases the reproducibility of scientific studies by offering more details about how the
experiments were conducted (Gil et al., 2007, p. 25);
• It promotes the reusability of data objects by helping scientists understand these objects in
scientific terms (Faniel and Jacobson, 2010; Van House, Butler and Schiff, 1998; Zimmerman,
2008).
Despite their significance, software and research methods (even datasets per se) are often
invisible and taken for granted, as infrastructure always is (Star & Ruhleder, 1994). Such examples
of invisible infrastructure in scholarly communication include the inconsistent and inadequate
mention and citation of these digital objects in scholarly products. In lights of this challenge, this
study takes an approach of "infrastructural inversion" (Bowker & Star, 2000), to expose how
research data infrastructure works, at least for the parts that involve software and research
method.
Conceptual backgrounds (continue)
REFERENCE
A metadata scheme of the software-data
relationship: A proposal
Kai Li
Research data and scientific software are two important objects in nearly every knowledge
domain in the data-driven paradigm of scientific studies. In spite of the increasing
attention they are receiving, their relationship in the real-life research contexts is yet to be
further explored so that we can better understand and describe how data flows through
the research lifecycle and how these two types of objects contribute to knowledge
production, individually and collectively. This apprehension of the relationship between
data and software will significantly promote the reproducibility of scientific studies and the
reusability of these research objects.
This poster proposes a project aiming at describing the relationship between data and
software in the context of scholarly communication.
Introduction
Data
software
Method
Data lifecycle
Research
lifecycle
Figure 1: conceptual map
The plan of this project is to develop a scheme of relationship between data, software, and
research method in the context of scholarly communication in the broad area of computational
social science. This project will rely upon a mixture of data sources, as listed in Table 1. Even
though during the first step of the project, only research papers and data papers will be used to
pursue the research questions.
An ontology describing the relationship between data, software, and research methods will be
presented in the next RDA Plenary.
Plan
Data source Advantages Disadvantages
Research paper:
the method and
material section
• It offers a direct and peer-reviewed
description of how data and
software are involved in the studies.
• There is a high volume of data
available.
• Rather than a complete description
of the topic, at least part of this
section’s aim is to fortify the paper
(Latour, 1987).
• It is subject to the community norms
of how a paper should be written.
Data paper • It aims at describing how software
and methods help to create the
dataset.
• It is also peer reviewed, which makes
it a relatively reliable source.
• It is also subject to the purposes of
the “inscription device” (Latour,
1987) and scientific norms, even
though these issues are less
discussed for data papers.
Description of
software
packages by
developers
• It is normally created by the
developer of the software, thus is a
first-hand description of the package.
• It only describes how the software is
supposed to be used, rather than
how it is actually used in the real-life
contexts.
Interview with
scientists as
users
• It is the most direct and user-
centered experience.
• The collection and analysis processes
are labor intensive and time-
consuming.
Table 1: Data source that will be used in this project
• Borgman, C. L. (2015). Big data, little data, no data: Scholarship in the networked world. Mit Press.
• Bowker, G. C. (2005). Memory practices in the sciences. Mit Press Cambridge, MA.
• Bowker, G. C., & Star, S. L. (2000). Sorting things out: Classification and its consequences. MIT press.
• Edwards, P. N. (2010). A vast machine: Computer models, climate data, and the politics of global warming. Mit Press.
• Faniel, I. M., & Jacobsen, T. E. (2010). Reusing scientific data: How earthquake engineering researchers assess the
reusability of colleagues’ data. Computer Supported Cooperative Work (CSCW), 19(3–4), 355–375.
• Gil, Y., Deelman, E., Ellisman, M., Fahringer, T., Fox, G., Gannon, D., … Myers, J. (2007). Examining the challenges of
scientific workflows. Computer, 40(12).
• Katz, D. S., Niemeyer, K. E., Smith, A. M., Anderson, W. L., Boettiger, C., Hinsen, K., … others. (2016). Software vs. data in
the context of citation. PeerJ Preprints, 4, e2630v1.
• Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Harvard university press.
• Lynch, C. (2014). The next generation of challenges in the curation of scholarly data. Research Data Management:
Practical Strategies for Information Professionals. Purdue University Press, West Lafayette, 395–408.
• Marcus, A., & Menzies, T. (2010). Software is data too. In Proceedings of the FSE/SDP workshop on Future of software
engineering research (pp. 229–232).
• Star, S. L., & Ruhleder, K. (1994). Steps Towards an Ecology of Infrastructure: Complex Problems in Design and Access for
Large-scale Collaborative Systems. In Proceedings of the 1994 ACM Conference on Computer Supported Cooperative
Work (pp. 253–264). New York, NY, USA: ACM.
• Van House, N. A., Butler, M. H., & Schiff, L. R. (1998). Cooperative knowledge work and practices of trust: sharing
environmental planning data sets. In Proceedings of the 1998 ACM conference on Computer supported cooperative
work (pp. 335–343).
• Zimmerman, A. S. (2008). New knowledge from old data the role of standards in the sharing and reuse of ecological
data. Science, Technology & Human Values, 33(5), 631–652.
Research
products
reported

A metadata scheme of the software-data relationship: A proposal

More Related Content

What's hot

Similar to A metadata scheme of the software-data relationship: A proposal

More from Kai Li

Recently uploaded

A metadata scheme of the software-data relationship: A proposal