EPrints Preservation: Why we need Preservation Planning


This short presentation is intended as a light interlude linking two practical sessions in a workshop connecting preservation planning with tools provided for use with EPrints repository software.

  • Digital preservation is an important topic, which can be perceived as technical and scary, and although it appears to attract interest and concern in equal measure, it is practiced somewhat less outside the specialist national and commercial institutions. This is because it often begins from a position of little focus, particularly in terms of realistic timescales, and there are problems when it comes to allocating resources in terms of cost, time and effort. To help understand why this might be, and to gauge at what point digital content and repository managers might expect a natural transition from interest/concern to practice, we produced this rough rule-of-thumb metric. If one or more of these criteria apply, then the application of digital preservation is likely to become magically less onerous and more beneficial for your content.
  • This is another way of saying the same thing as the previous slide about digital preservation, but it’s important to note the the converse point. We tend to think of what digital preservation can do for the content already collected, but it matters just as much to anticipate what content you will collect in addition over the specified timescale, and how this will affect your content profile. If your content profile is likely to change, then your preservation measures are likely to have to change as well.
  • What do we mean by content profile? In the KeepIt project we have four exemplar repositories.MiggiePickton’s presentation in the main Open Repositories 2010 conference introduced these repositories and their progress with preservation - see this blogged report of the presentation (http://blogs.ecs.soton.ac.uk/keepit/2010/07/14/exemplars-reveal-seven-steps-to-preservation-readiness/). The exemplars each provide different types of content: research, science data, teaching, arts. At a simple level, that could be the content profile of an institutional repository in the future. You may be able to do a similar analysis for your repository. Going back a few years we envisaged the emergence of preservation services and service providers for digital repositories. Essentially what we have now are a range of preservation tools rather than service provider organizations. So once you have done the analysis of the previous three slides, preservation resolves to the application of selected tools – so preservation is no more or less than other repository applications, whether CRIS, REF or other tools you use with your repository.
  • In the KeepIt project we ran a practicalcourse to introduce repository managers, from our exemplars and others, to selected preservation tools. The materials are all online, in Slideshare for a quick overview of the presentations, or in our repository for all the original source materials and practical exercises. And there are blogs for context, comment and subsequent practice (http://blogs.ecs.soton.ac.uk/keepit/tag/keepit-course/). Broadly, the training course covered: Understanding your institution: what the institution and its people can do for your repository e.g. in terms of providing content; and the context for what you can do for the institution, in terms of e.g. policy framework. Then you need to establish the resources available to you, in terms of the budgets you can acquire and the costs you have to cover. Finally – not least, but I mean finally – you will want to demonstrate trust, that is your way of showing that you have taken account of all these issues, and the risks all this might pose, and are serving the requirements of your institution. There are tools for all these things.The bit in the middle – modules 3 and 4 – arewhat we are considering today. What we might call technical preservation - understanding and managing the digital bits.
  • At the heart of technical preservation are file formats. Authors and content creators don’t choose formats, they choose applications based on functionality and what it allows them to create. Machines don’t see functionality, they see bits. Instead they try to represent functionality according to the whole computing environment that is being used at a given moment. Since you can’t control what tools authors use and what systems may be used to access content, you have to get your hands dirty and understand how the machine sees things, in particular how it sees fileformats.
  • Here is our preservation workflow showing three elements introduced in the preceding practical session of the workshop. We know we can classify the formats of digital objects using tools like DROID, and we’ve suggested that some formats might be high risk - without saying much about which ones and why – and that in such cases we have tools, for example, to migrate such formats to another format. The hard part is the bit in the middle that connects the identification with the action, i.e. what to act on, whether to act and, if so, when.
  • First, let’s try and relate this to your repositories, especially those that focus on open access collections of research papers. Here are the results from a survey posted on JISC-REPOSITORIES earlier this year. We can see the familiar emphasis on the PDF format.
  • There may be reasons for choosing PDF, and most will probably resolve to the debate about open source and open standard. A few years ago this was a simple case, notably against Microsoft formats. Two things happened: Nothing much changed in the high usage share of MS Office applications despite competition from free and open standard tools such as Open Office, MS standardized and opened its format specifications.The case is not so simple now.
  • There are more factors than ‘open’ to consider when assessing the risks associated with different file formats, as we can see from this list produced by the (UK) National Archives.
  • During our KeepIt course module 3 we set up a short group exercise to select a pair of formats and think about applying the format risk factors identified on the previous slide.
  • We had three groups, and here are the results for the formats they each chose to compare. We removed two factors that were slightly more technical and might have required some familiarity with documentation. Groups were asked to think about the remaining factors based on no more information than you have here. It’s meant to be that intuitive. How surprising are the results? Perhaps the first result is no surprise given published repository preferences, but the quick reverse for PDF against XML should give pause for thought. The image formats are relevant to our preservation planning exercises at this workshop; the TIFF vs JPEG result might be surprising in the context of information that follows in a slide or two.The table reproduced here comes from this report on KeepIt course module 3 http://blogs.ecs.soton.ac.uk/keepit/2010/03/31/digital-preservation-tools-for-repository-managers-primer-on-preservation-workflow-formats-and-characterisation/
  • In addition, we asked groups to provide a reason why the result of their format considerationsmight not stand up in all situations. You see, these are not definitive results.
  • Before we get back to the main workshop let’s consider the image format comparison. In the preceding exercise you imported some images in the GIF format, and next in the workshop we will consider possible migration to other formats. Here we consider TIFF vs JPEG. Remember the group result favoured TIFF over JPEG. Now let’s look at what some large archival organizations are doing with image formats. Basing the scores on the factors given, were the group right or wrong?
  • The group result was not necessarily wrong. Here’s another view performed by an expert archival group using Plato. It’s about using tools to provide information and expertise, but ultimately it’s your - the content manager’s - judgment and your decision. Try the exercise. It’s not as daunting as it seems. That’s where Plato comes in. We aren’t going to learn how to use Plato in this workshop, merely how to import a preservation plan from Plato, which is designed to act on selected formats, in this case on the GIF image format. Essentially Plato allows you to apply the expert information but control the parameters which lead the final outcome – that’s a decision on what to do with formats that are identified as at-risk, and taking the consequent action.At this point, having considered the role of file formats and the essentials of preservation workflow and preservation planning, we are ready to rejoin the practical workshop.
    1. 1. EPrints PreservationWhy we need Preservation Planning by Steve HitchcockEPrints User Group, OR10, Madrid, 9 July 2010<br />
    2. 2. A first take on digital preservation<br />DIGITAL PRESERVATION is <br /> NOT so DIFFICULT <br />  if you WANT to DO IT<br />You will want to do digital preservation if you have<br />a lot of digital content<br />collected over years<br />a specified responsibility and resources for that content<br />an understanding of how that content is used now<br />how it will be needed in future, <br />how the type of content you collect may change going forward<br />
    3. 3. Another take on digital preservation<br />Digital preservation is identifying what’s required of your repository content tomorrow, and being ready to serve that requirement - or the day after, or the month after. In other words, you can extend this to whatever timescale matters. You can work out everything else from these parameters. <br />Conversely: what will your content profile look like over the timescale that matters to you? Will it change substantially?<br />
    4. 4. Digital repositories diversifying: institution-wide outputs<br />KeepIt exemplar preservation repositories<br />Research<br />Arts<br />Science<br />Teaching<br />
    5. 5. Slidesharehttp://www.slideshare.net/SteveHitchcock/presentations<br />Source materials (ECS EPrints) http://bit.ly/afof8g<br />Module 1, Organisational issues, audit, selection and appraisal<br />School of ECS, University of Southampton, 19 January 2010<br />Module 2, institutional and lifecycle preservation costs <br />School of ECS, University of Southampton, 5 February 2010<br />Module 3, Primer on preservation workflow, formats and characterisation<br />Westminster-Kingsway College, London, 2 March 2010<br />Module 4, Putting storage, format management and preservation planning in the repository<br />University of Southampton, 18-19 March 2010<br />Module 5, Trust, of the repository, of the tools and services it chooses <br />University of Northampton, 30 March 2010<br />
    6. 6. Work with, not against, your authors and contributors<br />“Preservation begins with the author”<br />U. Rochester (USA) has written its own repository software IR+ to give its authors a Web-based authoring workspace, but watch out for new JISC project DepositMO, connecting the user's computer desktop, especially popular apps such as MS Office, with digital repositories.<br />Which applications are widely used and popular among your authors? Digital content authoring tools are typically chosen on the basis of purpose, utility, familiarity (what is provided, supported by Information Systems?) Rarely are they chosen for format or preservation.<br />Authors will craft their output in the chosen application, but will often throw away that craft if asked to convert to another format<br />
    7. 7. Analyse<br />Check<br />Action<br /><ul><li>Format identification, versioning
    8. 8. File validation
    9. 9. Virus check
    10. 10. Bit checking and checksum calculation</li></ul>Tools<br />e.g. DROID<br />JHOVE<br />FITS<br />Preservation planning<br />Characterisation:<br />Significant properties and technical characteristics, provenance, format, risk factors<br />Risk analysis<br />Tools<br />Plato (Planets)<br />PRONOM (TNA)<br />P2 risk registry (KeepIt)<br />INFORM (U Illinois)<br />KB<br /><ul><li>Migration
    11. 11. Emulation
    12. 12. Storage selection</li></ul>Preservation workflow<br />
    13. 13. Accepted repository formats: recent survey<br />What file formats do you accept? Do you convert any to a different format?<br />ALL: Accept any format.  <br />Two: Convert everything to PDF, but store the source files in the background for preservation reasons. <br />Four: Mention specifically converting Word to PDF: one seeks permission from the author to do this, and uploads as Word if permission is not granted. <br />One: Mentions converting ZIP files to PDF. <br />Sue Ashby, University of Portsmouth Library, Summary of responses to IR questionnaire, JISC-REPOSITORIES, 18 February 2010<br />
    14. 14. Some thoughts about formats<br />Free vs open source vs open standard:<br /><ul><li>MS Office – XML – open standard
    15. 15. Open Office – free – XML - open standard
    16. 16. PDF page representation
    17. 17. XML generic Web format, computational</li></li></ul><li>1000 Ubiquity: degree of adoption of the format<br />1001 Support: number of tools available which can access the format<br />1002 Disclosure: extent to which the format documentation is publicly disclosed<br />1003 Document Quality: completeness of the available documentation<br />1004 Stability: speed and backwards-compatibility of version change<br />1005 Ease of Identification: ease with which the format can be identified<br />1006 Ease of validation: ease with which the format can be validated<br />1007Lossiness: does the format use lossy compression<br />1008 Intellectual Property Rights: whether or not the format in encumbered by IPR<br />1009 Complexity: degree of content or behavioural complexity supported<br />From PRONOM documentation (The National Archives), July 2008<br />Format risks<br />
    18. 18. A group task on format risks<br />Choose two formats to compare (e.g. Word vs PDF, Word vs ODF, PDF vs XML, TIFF vs JPEG)<br />By working through the (surviving) list of format risks select a winner (or a draw) between your chosen formats for each risk category (1 point for win)<br />Total the scores to find an overall winning format<br />Suggest one reason why the winning format using this method may not be the one you would choose for your repository <br />
    19. 19. Format risk results (from group thinking) <br />PDF 4 Word 1<br />TIFF 3 JPEG 1<br />XML 6 PDF 1<br />
    20. 20. Alternative thoughts on ‘winning’ formats<br />We were then asked to consider why we might choose NOT to use the format that performed better for these criteria:
• PDF/Word – Why not PDF? PDF is essentially a conversion format, not a source authoring format.
• TIFF/JPEG – Why not TIFF? JPEG is compressed, would take up less space in storage. This factor may be crucial. Archival quality copy or a derivative?
• XML/PDF – Why not XML? Many repository resources are deposited in PDF. Do people understand what they need to do with XML?<br />
    21. 21. TIFF vs JPEG 2000?<br />Who’s for JPEG? The major players line up<br />The National Library of the Netherlands evaluated JPEG 2000 against uncompressed TIFF (currently used) for storage capacity, image quality, long-term sustainability, functionality. JPEG 2000 is recommended as future archive format. <br />The British Library recently moved forward to migrate their 80-terabyte newspaper collection from TIFF to JPEG 2000 <br />The Wellcome Library announced they will use JPEG 2000 for their upcoming digitization projects<br />Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009, http://www.dlib.org/dlib/november09/kulovits/11kulovits.html <br />
    22. 22. TIFF vs JPEG 2000?<br />What does Plato say?<br />“At this point in time not migrating the TIFF v6 images is the best alternative.”<br />“However, in one year we'll look at this plan again to see if there are more tools available and whether or not the ones we considered in this year's evaluation have been improved.”<br />Preservation Planning at the Bavarian State Library Using a Collection of Digitized 16th Century Printings, D-Lib Magazine, Vol15 No. 11/12, Nov/Dec 2009, http://www.dlib.org/dlib/november09/kulovits/11kulovits.html <br />