Data Selection & Triage JISC/DCC Progress Workshop ManagingResearch Data& Institutional Engagement Nottingham 25 October 2012 This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License
IntroductionHow can researchers and support staffeffectively decide what data is worth holdingon to, agree what to do with it, and arrangefor its handover?What challenges does this representHow to address them?
Outline• What guidelines are there and why do we need more?Angus Whyte DCC and Marie Therese Gramstadt - KAPTUR• UK Data Archives Data Review Process - Veerle van Eynden UKDA• Applying NERCs Data Value Checklist - Sam Pepler, British Atmospheric Data Centre• Discussion
Guidelines clarify expectations …adapted by Archaeology Data Service NERC KAPTUR University of Leicester What criteria will be used to judge what’s handed over?
Basic model1. Define a policy i.e. criteria and range of decisions All2. Archive manager applies data criteria, involving researchers3. Select the significant, dispose of the rest 10 %For records records yes, but researchdata? 90%
Characterising research data…• Research process more uncertain and open-ended than admin processes• Research data purpose may change before complete• More effort to make reusable - complex inter- relationships, and richer contexts to document• Originators should be engaged but may not have capacity e.g. if project funding has ceased• Others may need to be involved with broader view of potential in other disciplines• More than keep/dispose choice –need to prioritise attention and effort to make data fit for reuse
Triage analogy First Deposit location characteriseresearch data Institutional Data Prioritise Repository Criteria High reuse value + Data Centre needs attentionDuty of care affordable Subject Repository etc.Reuse value Other permutations Tiered approach toQuality and deploying resources More permutationscondition DiscoverabilityAccessibility Low reuse value, Unaffordable Access managementCosts associated Storage performance Potential to automate ? Preservation actions
Clarify expectations What kinds of “data” are wanted For what kinds of reuse
e.g.Data Centre Collection Policies “The ADS expects to collect all of the following archaeological data types…” http://archaeologydataservice.ac.uk/advice/collectionsPolicy 9
Costs should persuade us IDC Digital Universe Study- Increasing volumes outpace declining storage hardware costsAccording to: John Gantz and David Reinsel 2011 Extracting Value from Chaoshttp://www.emc.com/digital_universe. 10
We can’t afford it all “Keeping 2018’s data in S3 would cost the entire global GDP”http://blog.dshr.org/2012/05/lets-just-keep-everything-forever-in.html 11
Selection presumes description• You can’t value what you don’t know about!• Researchers can’t afford NOT to spend effort on minimal metadata description and organisation, because costs of retention will be much higher if they don’t• Description makes data affordable – is citation potential a concrete enough reward? 12
Challenges• Identify what datasets are created and where they are• Differentiate those that are of high value from those where most uncertainty or least reusability• Be able to justify ‘natural’ wastage of low priority data as much as deliberate selection of high value
Questions• What has worked/is working• What lessons have you learned and how generalisable• What challenges remain• How may they be approached and what do you intend to do• What DCC / MRD activity do you think may help make the challenge more tractable.