Introduction to digital curation


Published on

Slides for digital curation talk given at the DEFRA DTC Knowledge Exchange Workshop on August 9th, 2011

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Introduction to digital curation

  1. 1. Introduction to Digital Curation DTC Archive Knowledge Elicitation Workshop Windermere, August 9th, 2011 Gareth Knight, Digital Curation Specialist Centre for e-Research
  2. 2. Session Overview 1. What is digital curation and preservation? 2. Why should I curate? • Reasons to curate your data • Threats to protect against 1. How do you perform digital curation? • Organisational requirements • Practical steps
  3. 3. Curation and the digital lifecycle Digital Curation "The activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and reuse“ Philip Lord, et al. 2004 Digital Preservation ”…refers to the series of managed activities necessary to ensure continued access to digital materials for as long as necessary.” Beagrie & Jones, Preservation Management of Digital Materials: A Handbook2001, p10
  4. 4. Why Curate content? Researcher perspective 1. Protect value of content 2. Maximise visibility and impact of researcher 3. Enable continue development and use Institutional Perspective 4. Protect financial investment 5. Evidence of operation and impact 6. Compliance with appropriate regulations
  5. 5. Digital lifecycle threats (1) (1) Inability to locate: Data files have been lost/corrupted and an alternative copy cannot be found. • “What data was created by Project/Person X? in 2005” • “I wrote a contract & stored it on a shared drive in 2008. Where is it located now?” • “Person X wrote a useful software application before they left the institution. What happened to it? Has it been deleted or archived? (2) Inability to access: Data files cannot be accessed due to media and/or software issues •Floppy/tape reader not available to read media •Format obsolescence - access software obsolete through gradual replacement (e.g. MS Word 5, 95, XP), developer closes, OS incompatibility
  6. 6. Digital lifecycle threats (2) (3) Uncertainty over content Many data files exist, but it is unclear what constitutes the final product of research process and what is a by-product of investigation. • Lots of different formats – which is original and which is derivative • What is the organisational structure? (4) Inability to understand: Data files can be accessed using appropriate software, but context of research content cannot be established •When/How/By whom was the data created / recorded? Does this effect the interpretation? •What is the intended meaning of a db/spreadsheet column or data value? Will we need a Rosetta Stone to understand content?
  7. 7. Digital lifecycle threats (3) (5) Questionable authenticity Content interpretation changes between software. How do you know that content represents authors intent? •Decoding differences between product versions, e.g. MS Word 97 & 2010 •Alternative products, e.g. OpenOffice vs. MS Word •Change may be introduced by format conversion tools (6) Uncertainty regarding usage rights Rights issues associated with publication and use of content is unclear – should the institution err on the side of caution? •Do have right to store data? •Do have right to preserve it? •Do have right to publish it over time? Does this expire? MS Word 2 MS Word 2010
  8. 8. Institutional requirements to curate data • Organisational infrastructure: Must maintain an organisational structure capable of accepting, maintaining and making available information in a trustworthy manner. • Technical infrastructure: Must possess appropriate strategy and procedures to maintain digital objects at a technical level • Legal: Must possess IPR or appropriate permission to curate and preserve information • Money: Require £££ to pay for staff, electricity, space, etc. to pay for long-term curation – difficult as real costs of long-term digital preservation are not clear.
  9. 9. Data Storage Reality: ALL digital storage media is unreliable: • Gradual degradation over time • Unexpected failure through power surge, unexpected motion, and theft • 3rd party storage providers can close their service & delete your content Practical approaches to take: • Appraise - do you need to keep everything? • Store content on at least 2 forms of storage in different locations, e.g. one on office shared drive and a remote copy (offsite backup) • Perform regular fixity checks to test your backups • Copy data files to new media every 2-5 years after first creation
  10. 10. Data Organisation •If someone examined your data for the first time, what would they wish to know? • What research collection is contained within the directory? • What type of information does it contain? • Where can I find specific content, e.g. final report, analysis data? •Practical approaches to take: • Establish directory structure that clearly distinguishes between groups of files (e.g. reports, photographs, etc.) Use sub- directories for sub-categories (e.g. topics, date, version) • Adopt a consistent approach to organising directories (across your department, if possible) • Label files in manner that allows purpose, version and other relevant information to be quickly identified (e.g. using filename, cover page)
  11. 11. Choosing appropriate formats How do you choose correct file format to store your content in short & long-term? • Each format has diff. capabilities & are not suitable for every task, e.g. MSWord not suitable for web access, etc. • Some formats remove content or functionality to reduce file size & limit use, e.g. JPEGs lack detail, difficult to edit PDFs Practical approaches to take: Select diff. formats based upon needs, rather than single format: • Digital master: Preservation copy intended for long-term storage • Dissemination: Access formats for use by specific users, e.g. PDF •Format of the digital master: • Try to use common, widely used formats supported by a range of software tools. • Store content in formats that support required attributes (e.g. 16 million colours) and will not degrade when resaved – ensure that you re-examine your file after you’ve saved it • Retain all data associated with original creation/capture process – may contain information properties that is useful at a later date
  12. 12. Documentation Description Information necessary to interpret, understand and use a given dataset or set of documents What would someone wish to know about your content? • Who created it? • When was it created? • Why was it created? Who funded it? • What is the source of the material used? • What is the motivation for the approach you took? • What content can be published? • How can it be used? Practical approach to take: • Attach a cover page to your document with relevant creator & rights information • Create a catalogue record for your digital repository • Create an administrative file for internal use that can help colleagues and repository staff and assign it an appropriate filename
  13. 13. Conclusion •Digital curation is an institutional commitment to maintaining data • Data has value – should not leave it to researcher to protect value • Infrastructure required to curate data in a trustworthy manner •Many strategies available to curate & preserve • Short & long-term choices can help or hinder content access over time. • Challenge is to select those that enable continued access, while minimising the risk of data loss or corruption •Easy steps to protect the value of your data: • Store your data in 2 or more locations • Organise it using an easy to understand structure • Adopt a digital master format that is fit for purpose • Document information that cannot be obtained elsewhere
  14. 14. Thank You for your attention QUESTIONS? Gareth Knight