This document presents an agile conceptual framework called "Agile Data Curation" for improving data management practices. It discusses why data management is important both internally for research and externally for public access and reproducibility. The framework is based on principles adapted from agile software development, focusing on early and continuous data sharing, incremental improvements, and community involvement. The presenters provide background on developing the framework through discussions at conferences and within the Research Data Alliance community of interest.
1. Agile Data Curation:
A Conceptual Framework and
Approach for Practitioner
Data Management
Presenting Author: Josh Young1
Co-Authors: Karl Benedict2 and Christopher Lenhardt3
1.UniversityCorporationforAtmosphericResearch(UCAR)UnidataProgramCenter,Boulder,USA
3. Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, USA
2.UniversityofNewMexico, AlbuquerqueUSA
2. Scope
Imagine a project:
• that includes a well-thought out and documented
data management plan,
• and robust implementation of that plan through out
the project and beyond.
• This talk is not for that project; it is for the rest of
us.
3. So why do we care about data
management?
• Internal reasons: do good research, write
papers, get tenure, win more grants.
• External reasons: public access &
reproducibility
Risk of becoming dark data (Heidorn, 2008)
4. Why care about external access?
• Intangibles for an Investigator
• Maybe someday I’ll benefit from someone else’s data
• Maybe I’ll learn something through informal dialogue
• Most science funding is from public resources and should/could be
considered a public trust resource
• Peer pressure
• Tangibles for an Investigator
• Increased efficiency
• My funders require it.
5. So why do we care about data
management?
• Internal reasons: do good research, write
papers, get tenure, win more grants.
• External reasons: greater impact
Agile
Curation
8. Agile Curation:
• Means taking implementable steps to
improve data management for external
access.
• Philosophically, it attempts to apply
lessons from agile software development
to data management.
9. Agile Curation Principles,
2nd Generation
1) Delivery, access, use and citation of research
data are the primary measures of success.
2) Maximize the impact of research data through the
continuous integration of curation activities
3) Support unanticipated needs for and uses of
research data (and documentation) and develop
flexible systems to capture new uses.
10. Agile Curation Principles,
2nd Generation
4) Make data open and accessible as early in the process as
possible.
5) Encourage crowd-sourced / community feedback to improve
and enhance the data. Provide basic metadata for data
available early in the process even if the data are not
finalized.
6) Identify key individuals in a research project that have the
requisite motivation, knowledge, or ability to learn and get
out of their way.
11. Agile Curation Principles,
2nd Generation continued
7) Data creators and data curators should work closely
throughout the data life story to ensure the most efficient and
streamlined process.
8) Identify the most effective method(s) for maintaining close
communication between the data creators and curators
involved and use them.
9) Target the steady delivery of incremental improvements to
research data discovery, access and use that is consistent
with a sustainable level of effort and available funding.
12. Agile Curation Principles,
2nd Generation continued
9) Start with the basics and only make systems more
complex as needed, while maintaining a low bar to
entry.
10) Continuous attention to technical excellence and
good design enhances agility.
11) Continuously develop a community of data providers,
curators and users that participate in the evolution of
the research data systems.
13. What happens next?
• Case Studies documentation:
To clarify and/or verify these principles
To provide workflow examples that can
be adopted or revised for reuse
• Nascent community of interest within the
Research Data Alliance
14. Scope
Imagine a project:
• that includes a well-thought out data management
plan,
• and robust implementation of that plan through out
the project.
• This talk is not for that project; it is for the rest of
us.
15. Unidata is one of the University Corporation for
Atmospheric Research (UCAR)'s Community
Programs (UCP), and is funded primarily by
the National Science Foundation
(Grant NSF-1344155).
18. Agile Curation Principles,
1st Generation
1) Access to data is the first goal;
2) Generative value is supported (Zittrain, 2006)
3) Researcher involvement through a participatory framework that
aligns data management with scientific research processes
(Yarmey and Baker, 2013)
4) Projects will utilize free open-source resources to the greatest
extent practical;
5) Community participation increases project capacity;
19. Agile Curation Principles,
1st Generation part 2
6) Data management requirements and practices evolve as the
research project proceeds;
7) Bright and dedicated individuals can learn appropriate skills and
respond to the demands of their particular project, as they
proceed;
8) Approaches apply across scales
9) Consider technical debt
10) Data evaluation can be conducted through use and feedback;
20. How we got here
• Idea formulated during discussion of Data
Management Lifecycles at GeoData 2014
• Principles drafted for AGU 2014
• Two Research Data Alliance (RDA) Birds of a
Feather sessions to explore community
experiences
Editor's Notes
This work is a joint effort of all authors.
This talk and effort is inspired by the desire to move projects currently at risk of becoming dark data to at least become long tail data. However, the concepts described maybe useful to projects currently in the long tail or even big head spectrum.
We need to recognize that there are at least two motivations for data management: internal reasons and external reasons. As researchers, there is a focus on our internal research needs but from a societal perspective the potentially greater value is from external access.
Agile curation is not focused on assisting you with the workflow for your internal goals (though their maybe benefits there too). Instead the focus is on helping researchers meet external data management challenges.
Internal workflows tend to be optimized at least based on the preferences of the individual researcher.
Public-access or external access from the perspective of most researchers is at best a secondary purpose. These workflows are not optimized in the same way. These photos are analogous examples. A sign may be put out notifying the public something is freely available but the quality statement may be questioned (sign says good free stuff but it is for upholstered furniture in snow), it may offer no quality descriptor, or even no sign notifying free access and instead relies on awareness of social conventions. Does this sound like our current public access approach?