• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
KeepIt Course 3: preservation workflow
 

KeepIt Course 3: preservation workflow

on

  • 771 views

This presentation introduces preservation workflow, a process to manage the risk associated with file formats of different digital objects. It was given as part of module 3 of a 5-module course on ...

This presentation introduces preservation workflow, a process to manage the risk associated with file formats of different digital objects. It was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/

Statistics

Views

Total Views
771
Views on SlideShare
771
Embed Views
0

Actions

Likes
0
Downloads
12
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    KeepIt Course 3: preservation workflow KeepIt Course 3: preservation workflow Presentation Transcript

    • Introduction to preservation workflow, formats and risks This section by Steve Hitchcock, KeepItproject For JISC KeepItcourse on Digital Preservation Tools for Repository Managers Module 3, Primer on preservation workflow, formats and characterisation Westminster-Kingsway College, London, 2 March 2010
    • Overview of session • Some terminology • Preservation workflow • Repository format profiles • Format risks: a group task • Some thoughts about formats
    • Representation information: connecting data with what we see
    • Open Archival Information System (OAIS) reference model
    • The 3-stage repository model Get Content Manage Content Serve Content (Ingest) Appraise & Select Store Retrieve Index Locate Ingest Preservation - Check Preservation - Analyse Preservation - Action Dispose
    • Preservation workflow Check Analyse Action •Format Preservation planning •Migration identification, version Characterisation: • Emulation ing Significant properties and • Storage selection • File validation technical • Virus check characteristics, provenance, for • Bit checking and mat, risk factors checksum calculation Risk analysis Tools Tools e.g. DROID Plato (Planets) JHOVE PRONOM (TNA) FITS P2 risk registry (KeepIt) INFORM (U Illinois)
    • Rosenthal on: doing less for preservation “I believe that "it became necessary to change the content in order to preserve it" is a very bad idea; we should preserve what's out there without adding cost and losing information by preemptively migrating to a format we believe (normally without evidence) is less doomed.” Are format specifications important for preservation? (January 4, 2009) http://blog.dshr.org/2009/01/are-format- specifications-important-for.html
    • Rosenthal on: aggressivevsrelaxed preservation “In the long run, all digital formats become obsolete. Broadly, reactions to this dismal prospect have taken two forms: - The aggressive form has been to do as much work as possible as soon as possible -The relaxed form has been to postpone doing anything until it is absolutely essential Format Obsolescence: the Prostate Cancer of Preservation (May 7, 2007) http://blog.dshr.org/2007/05/format-obsolescence- prostate-cancer-of.html
    • Rosenthal on: the Prostate Cancer of Preservation “format obsolescence is the prostate cancer of digital preservation. It is a serious and ultimately fatal problem. But it is highly likely that something else will kill you first “A risk-based approach would surely prefer the "relaxed" approach, minimizing up-front and storage costs, and thereby freeing up resources to preserve more, and higher-risk content. “The best example of a "relaxed" ingest pipeline is the Internet Archive, which has so far ingested over 85 billion web pages with minimal human intervention.” Format Obsolescence: the Prostate Cancer of Preservation (May 7, 2007) http://blog.dshr.org/2007/05/format-obsolescence-prostate-cancer-of.html
    • Repository format profile: an example Originally from Registry of Open Access Repositories (ROAR)
    • ROAR format profiles today This profile for Australian Research Online repository To access a format profile: Find chosen repository in ROAR, open [Record Details] Format profiles not available for all repositories in ROAR ROAR disclaimer: Full-text formats is based on automatic file-format identification and is prone to errors
    • Accepted repository formats: a recent survey What file formats do you accept? Do you convert any to a different format? • All accept any format. • Two convert everything to PDF, but store the source files in the background for preservation reasons. • Four mention specifically converting Word to PDF: one seeks permission from the author to do this, and uploads as Word if permission is not granted. • One mentions converting ZIP files to PDF. Sue Ashby University of Portsmouth Library Summary of responses to IR questionnaire JISC-REPOSITORIES, 18 February 2010
    • Format risks 1000 Ubiquity: degree of adoption of the format 1001 Support: number of tools available which can access the format 1002 Disclosure: extent to which the format documentation is publicly disclosed 1003 Document Quality: completeness of the available documentation 1004 Stability: speed and backwards-compatibility of version change 1005 Ease of Identification: ease with which the format can be identified 1006 Ease of validation: ease with which the format can be validated 1007Lossiness: does the format use lossy compression 1008 Intellectual Property Rights: whether or not the format in encumbered by IPR 1009 Complexity: degree of content or behavioural complexity supported From PRONOM documentation (The National Archives), July 2008
    • Format risks 1000 Ubiquity: degree of adoption of the format 1001 Support: number of tools available which can access the format 1002 Disclosure: extent to which the format documentation is publicly disclosed 1003 Document Quality: completeness of the available documentation 1004 Stability: speed and backwards-compatibility of version change 1005 Ease of identification: ease with which the format can be identified 1006 Ease of validation: ease with which the format can be validated 1007Lossiness: does the format use lossy compression 1008 Intellectual property rights: whether or not the format is encumbered by IPR 1009 Complexity: degree of content or behavioural complexity supported From PRONOM documentation (The National Archives), July 2008
    • A group task on format risks 1. Choose two formats to compare (e.g. Word vs PDF, Word vs ODF, PDF vs XML, TIFF vs JPEG) 2. By working through the (surviving) list of format risks select a winner (or a draw) between your chosen formats for each risk category (1 point for win) 3. Total the scores to find an overall winning format 4. Suggest one reason why the winning format using this method may not be the one you would choose for your repository
    • Some thoughts about formats Free vs open source vs open standard: •MS Office – XML – open standard •Open Office – free – XML - open standard •PDF page representation •XML generic Web format, computational
    • Rosenthal on: why we can relax about preservation “Historically, the open source community has developed rendering software for almost all proprietary formats that achieve wide use “Even the formats which pose the greatest problems for preservation, those protected by DRM technology, typically have open source renderers” Format Obsolescence: Scenarios (April 29, 2007) http://blog.dshr.org/2007/04/format-obsolescence- scenarios.html
    • Work with, not against, your authors and contributors • “Preservation begins with the author” • U. Rochester (USA) has written its own repository software IR+ to give its authors a Web-based authoring workspace • But which applications are widely used and popular among your authors? Digital content authoring tools are typically chosen on the basis of purpose, utility, familiarity (what is provided, supported by Information Systems?) Rarely are they chosen for format or preservation. • Authors will craft their output in the chosen application, but will often throw away that craft if asked to convert to another format • One approach that builds on popular formats is ICE: Integrated Content Environment, which converts formats from popular content authoring tools
    • An image format comparison: TIFF vs JPEG 2000? Studies and user reports claim JPEG 2000 to be – or at least will become – the next archiving format for digital images The format offers new possibilities, such as streaming, and reduces storage consumption through lossless and lossy compression. Another often claimed advantage of JPEG 2000 is that the master image can possibly serve as the access copy as well, and thus replace derived compressed, low resolution access copies. Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009 http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
    • TIFF vs JPEG 2000? Who’s for JPEG? The major players line up 1. The National Library of the Netherlands evaluated JPEG 2000 against uncompressed TIFF (currently used) for storage capacity, image quality, long-term sustainability, functionality. JPEG 2000 is recommended as future archive format. 2. The British Library recently moved forward to migrate their 80-terabyte newspaper collection from TIFF to JPEG 2000 3. The Wellcome Library announced they will use JPEG 2000 for their upcoming digitization projects Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009 http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
    • TIFF vs JPEG 2000? What does Plato say? “At this point in time not migrating the TIFF v6 images is the best alternative. “However, in one year we'll look at this plan again to see if there are more tools available and whether or not the ones we considered in this year's evaluation have been improved.” Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009 http://www.dlib.org/dlib/november09/kulovits/11kulovits.html
    • Further reading on formats and risk • Malcolm Todd, File Formats for Preservation, DPC Technology Watch Report Series, Report 09-02, 2 December 2009 •http://www.dpconline.org/newsroom/file-formats-for-preservation- technology-watch-report.html • Judith Rog and Caroline van Wijk, Evaluating File Formats for Long-term Preservation, KoninklijkeBibliotheek, February 2008
http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/K B_file_format_evaluation_method_27022008.pdf •See also Preserv project bibliography for many more papers on file formats http://preserv.eprints.org/Preserv-bibliography.html