Digital Preservation – An Introduction DPE/Planets/nestor training event  October, 1st- 5th 2007 Vilnius, Lithuania Stefan Strathmann Göttingen State and University Library nestor -  Network of Expertise in Long-Term Storage of Digital Resources
Session Outline 10:00 – 10:45 Lecture 10:45 – 11:00 Discussion  11:00 – 11:30 Coffee Break 11:30 – 12:30 Group Work 12:30 – 13:00 Groups present their results  13:00 – 13:15 Summary discussion
Key Questions What is digital preservation? Why is digital preservation important? What are the big challenges? What are the relevant standards, initiatives, programs?
Table of Contents General Introduction Relevant Aspects
Digital Preservation – The Challenge Hardware and Software are becoming obsolete in  very short periods of time Incompatibility of different versions of hard- and software  Fading knowledge of how to use older hard- and software Aging and decaying storage media  Loss of Information
Example – Loss of Information Acrobat 5 Acrobat 7
UNESCO Charter on the Preservation of Digital Heritage, October 15th, 2003 Article 1: “ The digital  heritage  consists of unique resources of human knowledge and expression.” “ Many of these resources have lasting value and significance, and therefore constitute a heritage that should be  protected and preserved  for current and future generations.”
UNESCO Charter: Articles Article 2 – Access to the digital heritage  Article 3 – The threat of loss  Article 4 – Need for action  Article 5 – Digital continuity  Article 6 – Developing strategies and policies  Article 7 – Selecting what should be kept  Article 8 – Protecting the digital heritage  Article 9 – Preserving cultural heritage Article 10 – Roles and responsibilities  Article 11 – Partnerships and cooperation
Digital Resources New forms of information: digital production (digitization, born digital, only digital) digital publication (only digital, object features like retrieval) digital distribution (portal, value chain) Rapid change of technology
Digital Long-Term Preservation Digital Preservation consists of processes that ensure that digital objects remain   accessible,  (re-)usable and  understandable in the future. Digital Preservation has to ensure that future software and hardware tools retain the authenticity, integrity, and reliability of the digital object.
Digital Preservation – A Definition What is meant by „digital long-term preservation“ or “digital preservation”? Definition by Ute Schwens / Hans Liegmann (DNB/nestor): “ In terms of preserving digital resources, ‘long-term’ does not mean issuing a guarantee for five or fifty years, rather the responsible development of strategies which can cope with the constant changes brought about by the information market.”
Preservation Approaches Migration  Emulation Normalisation Refreshing  Digital Archaeology Hardware Museum/Technology Preservation  Print to Paper or Microfilm/fiche or barcode ...
Digital Information – An Estimate UC Berkeley‘s School of Information Management and Systems: How much Information? 2003 Analysis of the year 2002 to estimate the yearly increase of new (digital and analog) information. Finding: 30 % increase of digital information per year See: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/index.htm
Heterogeneity - Materials Journals and monographs retrodigitized material genuine digital material Web Documents, Web Server Preprint-Server, theses, e-Proceedings, etc. Primary data, research data, raw data Emails, blogs, etc. Film, Music, Multimedia etc. ...
Heterogeneity: Formats Depends on subject, e.g. Mathematics (TEX, PS, ...) Geography (GIS) ... Multimedia, e.g. Animated WWW pages Interactive objects in e-Learning ... Different versions in e.g. PDF, TEX, ... Presentation Format / Preservation Format
Heterogeneity - General Metadata formats  (Dublin Core, MODS, PREMIS, MIX, ..) Exchange formats (XML, METS, XML/RDF, SOAP, ...) Controlled vocabulary systems (Ontologies, Taxonomies, ...) Architecture, Protocols ... Standardisation & Interoperability
Dealing with the Heterogeneity Preservation policy Cooperation: international/national Cooperation: cross-domain (e.g. museums, archives, research institutes, commercials, ...) Redundancy of digital repositories explicitly desired Cooperative management/administration of distributed digital archives/repositories
... Coordinated cooperation needed between: producers of digital objects (e.g. scientists) providers (e.g. libraries) distributors (e.g. publishers, hosts of db) Use of international standards (e.g. DC, OAI, OAIS, METS)
producer consumer SIP DIP Access Archival storage AIP AIP Administration Preservation Planning SIP AIP DIP Submission Information Package Archival  Information Package Dissemination Information Package Ingest Data management OAIS Model – Example for a Standard
Relevant Aspects Technical Issues / Obsolescence Identification & Validation of Formats Preservation Metadata  Preservation Policy Legal Aspects Trusted Repositories
Technical Issues / Obsolescence Digital information is stored as a bit stream on physical media => Preservation of the bit stream! Storage media types change quickly and are subject to obsolescence  Storage media are unstable and can degrade quickly Keeping the bit stream accessible Migration (Medium and Format) Emulation (Hard- and Software) ...
Formats: Identification & Validation Examples: Document - DOC, HTML Raster Images - TIFF, PNG, JPEG Structured graphics - CAD, VSD,  Audio - WAV, MP3, MIDI Video - MPEG, AVI Databases - DBF, MDB Raw data Collections - tar, zip … We are dealing with lots of different formats! File format registries may help to handle the heterogeneity.
File Format Registries: Use Cases Identification I have a digital object; what format is it? Validation I have an object purportedly of format  F ; is it? Transformation I have an object of format  F , but need  G ; how can I produce it? Characterization I have an object of format  F ; what are its features? Risk assessment I have an object of format  F ; is it at risk of obsolescence? Delivery I have an object of format  F ; how can I render it? (Abrams, Seaman: Towards a global digital format registry. IFLA 2003)
Format validation with JOHVE JSTOR/Harvard Object Validation Environment see: http://hul.harvard.edu/jhove/ The concept of representation format, or type, permeates all technical areas of digital repositories. Policy and processing decisions regarding object  ingest, storage, access, and preservation  are frequently conditioned on a per-format basis. In order to achieve necessary operational efficiencies, repositories need to be able to automate these procedures to the fullest extent possible How much technical metadata do I need?
Preservation Metadata All Preservation strategies (migration, emulation, etc.) depend on the creation, capture and maintenance of suitable metadata: "Preserving the right metadata is key to preserving digital objects" (ERPANET Briefing Paper, 2003) "It's all about metadata" (Cedars project manager, ca. 2000)
Preservation Metadata Specific preservation metadata are necessary to ensure that information can be accessed in the future, e.g. metadata about: Provenance Structure File Format(s) Technical Environment Rights Much of the necessary metadata can be extracted automatically, e.g. via tools like JHOVE
Preservation Policy What do you want to preserve? Why do you want to preserve? How do you want to render an object in the future?  Furthermore ... Documentation Policy for short-term preservation Policy for long-term preservation …
Preservation Policy What kind of digital objects is the repository responsible for? Fixed format texts, images, web resources, complex digital objects, datasets, … What do you want to render in the future?  Keep the original? What is the original? Offer extended functionalities?
Preservation Policy What are the significant properties of the object? Appearance (layout, colour, font size, etc) Behaviour (functionality, interaction, etc) Structure (chapter, section, etc) Content (text, video, audio, etc) Context (cross-references, etc) How do you want to provide access?  Designated User Community Options for the user?
But Policies/Strategies are not enough ... …  we need tools that help choose & perform a strategy  make the strategy possible (emulators, migration tools) maintain the link between originals and conversions enable interoperability and co-operation between different repositories/archives Tools have to be implemented in the archiving system and archiving workflow. Preservation has to come to practice!
Legal Aspects Copyright and other intellectual property rights (IPR) have a substantial impact on digital preservation Preservation of digital materials is dependent on a range of strategies, which has implications for IPR in those materials  Consideration may need to be given not only to content but to any associated software  Specific permissions may be very challenging e.g. for webarchiving or digital art
Legal Aspects: Examples What will be covered by legal deposit? How much is served from within the country? Strategy The national publication archive How are roles/responsibilities shared? Web archiving initiatives (e.g. European Web Archive) Development of electronic deposit systems International collaboration Other international repositories Levels of redundancy Access restrictions
..., but Digital preservation is often a legal grey area not yet understood or considered by legislators Lack of legal certainty should not prevent digital preservation actions Take action to manage risks
Trusted Repositories Why trusted repositories? It is very easy to manipulate digital information  The users need to trust the accessed information Nobody is able to preserve everything – distributed preservation management Criteria of trusted repositories (i.e.  TRAC , nestor) Administrative responsibility Financial sustainability Technical security ...
Thank you very much for your attention! Comments? Questions? Stefan Strathmann Göttingen State and University Library [email_address]
Exercise Which  Digital Preservation  issues are relevant in the context of your Digital Collection? How are they relevant? Data creation? Data management (collection management)? Data storage? Data documentation and description? Data preservation? Data use? Rights management? ... Try to describe a digital preservation Framework for your  institution.
Session Outline 10:00 – 10:45 Lecture 10:45 – 11:00 Discussion  11:00 – 11:30 Coffee Break 11:30 – 12:30 Group Work 12:30 – 13:00 Groups present their results  13:00 – 13:15 Summary discussion

Trm Introduction

  • 1.
    Digital Preservation –An Introduction DPE/Planets/nestor training event October, 1st- 5th 2007 Vilnius, Lithuania Stefan Strathmann Göttingen State and University Library nestor - Network of Expertise in Long-Term Storage of Digital Resources
  • 2.
    Session Outline 10:00– 10:45 Lecture 10:45 – 11:00 Discussion 11:00 – 11:30 Coffee Break 11:30 – 12:30 Group Work 12:30 – 13:00 Groups present their results 13:00 – 13:15 Summary discussion
  • 3.
    Key Questions Whatis digital preservation? Why is digital preservation important? What are the big challenges? What are the relevant standards, initiatives, programs?
  • 4.
    Table of ContentsGeneral Introduction Relevant Aspects
  • 5.
    Digital Preservation –The Challenge Hardware and Software are becoming obsolete in very short periods of time Incompatibility of different versions of hard- and software Fading knowledge of how to use older hard- and software Aging and decaying storage media Loss of Information
  • 6.
    Example – Lossof Information Acrobat 5 Acrobat 7
  • 7.
    UNESCO Charter onthe Preservation of Digital Heritage, October 15th, 2003 Article 1: “ The digital heritage consists of unique resources of human knowledge and expression.” “ Many of these resources have lasting value and significance, and therefore constitute a heritage that should be protected and preserved for current and future generations.”
  • 8.
    UNESCO Charter: ArticlesArticle 2 – Access to the digital heritage Article 3 – The threat of loss Article 4 – Need for action Article 5 – Digital continuity Article 6 – Developing strategies and policies Article 7 – Selecting what should be kept Article 8 – Protecting the digital heritage Article 9 – Preserving cultural heritage Article 10 – Roles and responsibilities Article 11 – Partnerships and cooperation
  • 9.
    Digital Resources Newforms of information: digital production (digitization, born digital, only digital) digital publication (only digital, object features like retrieval) digital distribution (portal, value chain) Rapid change of technology
  • 10.
    Digital Long-Term PreservationDigital Preservation consists of processes that ensure that digital objects remain accessible, (re-)usable and understandable in the future. Digital Preservation has to ensure that future software and hardware tools retain the authenticity, integrity, and reliability of the digital object.
  • 11.
    Digital Preservation –A Definition What is meant by „digital long-term preservation“ or “digital preservation”? Definition by Ute Schwens / Hans Liegmann (DNB/nestor): “ In terms of preserving digital resources, ‘long-term’ does not mean issuing a guarantee for five or fifty years, rather the responsible development of strategies which can cope with the constant changes brought about by the information market.”
  • 12.
    Preservation Approaches Migration Emulation Normalisation Refreshing Digital Archaeology Hardware Museum/Technology Preservation Print to Paper or Microfilm/fiche or barcode ...
  • 13.
    Digital Information –An Estimate UC Berkeley‘s School of Information Management and Systems: How much Information? 2003 Analysis of the year 2002 to estimate the yearly increase of new (digital and analog) information. Finding: 30 % increase of digital information per year See: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/index.htm
  • 14.
    Heterogeneity - MaterialsJournals and monographs retrodigitized material genuine digital material Web Documents, Web Server Preprint-Server, theses, e-Proceedings, etc. Primary data, research data, raw data Emails, blogs, etc. Film, Music, Multimedia etc. ...
  • 15.
    Heterogeneity: Formats Dependson subject, e.g. Mathematics (TEX, PS, ...) Geography (GIS) ... Multimedia, e.g. Animated WWW pages Interactive objects in e-Learning ... Different versions in e.g. PDF, TEX, ... Presentation Format / Preservation Format
  • 16.
    Heterogeneity - GeneralMetadata formats (Dublin Core, MODS, PREMIS, MIX, ..) Exchange formats (XML, METS, XML/RDF, SOAP, ...) Controlled vocabulary systems (Ontologies, Taxonomies, ...) Architecture, Protocols ... Standardisation & Interoperability
  • 17.
    Dealing with theHeterogeneity Preservation policy Cooperation: international/national Cooperation: cross-domain (e.g. museums, archives, research institutes, commercials, ...) Redundancy of digital repositories explicitly desired Cooperative management/administration of distributed digital archives/repositories
  • 18.
    ... Coordinated cooperationneeded between: producers of digital objects (e.g. scientists) providers (e.g. libraries) distributors (e.g. publishers, hosts of db) Use of international standards (e.g. DC, OAI, OAIS, METS)
  • 19.
    producer consumer SIPDIP Access Archival storage AIP AIP Administration Preservation Planning SIP AIP DIP Submission Information Package Archival Information Package Dissemination Information Package Ingest Data management OAIS Model – Example for a Standard
  • 20.
    Relevant Aspects TechnicalIssues / Obsolescence Identification & Validation of Formats Preservation Metadata Preservation Policy Legal Aspects Trusted Repositories
  • 21.
    Technical Issues /Obsolescence Digital information is stored as a bit stream on physical media => Preservation of the bit stream! Storage media types change quickly and are subject to obsolescence Storage media are unstable and can degrade quickly Keeping the bit stream accessible Migration (Medium and Format) Emulation (Hard- and Software) ...
  • 22.
    Formats: Identification &Validation Examples: Document - DOC, HTML Raster Images - TIFF, PNG, JPEG Structured graphics - CAD, VSD, Audio - WAV, MP3, MIDI Video - MPEG, AVI Databases - DBF, MDB Raw data Collections - tar, zip … We are dealing with lots of different formats! File format registries may help to handle the heterogeneity.
  • 23.
    File Format Registries:Use Cases Identification I have a digital object; what format is it? Validation I have an object purportedly of format F ; is it? Transformation I have an object of format F , but need G ; how can I produce it? Characterization I have an object of format F ; what are its features? Risk assessment I have an object of format F ; is it at risk of obsolescence? Delivery I have an object of format F ; how can I render it? (Abrams, Seaman: Towards a global digital format registry. IFLA 2003)
  • 24.
    Format validation withJOHVE JSTOR/Harvard Object Validation Environment see: http://hul.harvard.edu/jhove/ The concept of representation format, or type, permeates all technical areas of digital repositories. Policy and processing decisions regarding object ingest, storage, access, and preservation are frequently conditioned on a per-format basis. In order to achieve necessary operational efficiencies, repositories need to be able to automate these procedures to the fullest extent possible How much technical metadata do I need?
  • 25.
    Preservation Metadata AllPreservation strategies (migration, emulation, etc.) depend on the creation, capture and maintenance of suitable metadata: "Preserving the right metadata is key to preserving digital objects" (ERPANET Briefing Paper, 2003) "It's all about metadata" (Cedars project manager, ca. 2000)
  • 26.
    Preservation Metadata Specificpreservation metadata are necessary to ensure that information can be accessed in the future, e.g. metadata about: Provenance Structure File Format(s) Technical Environment Rights Much of the necessary metadata can be extracted automatically, e.g. via tools like JHOVE
  • 27.
    Preservation Policy Whatdo you want to preserve? Why do you want to preserve? How do you want to render an object in the future? Furthermore ... Documentation Policy for short-term preservation Policy for long-term preservation …
  • 28.
    Preservation Policy Whatkind of digital objects is the repository responsible for? Fixed format texts, images, web resources, complex digital objects, datasets, … What do you want to render in the future? Keep the original? What is the original? Offer extended functionalities?
  • 29.
    Preservation Policy Whatare the significant properties of the object? Appearance (layout, colour, font size, etc) Behaviour (functionality, interaction, etc) Structure (chapter, section, etc) Content (text, video, audio, etc) Context (cross-references, etc) How do you want to provide access? Designated User Community Options for the user?
  • 30.
    But Policies/Strategies arenot enough ... … we need tools that help choose & perform a strategy make the strategy possible (emulators, migration tools) maintain the link between originals and conversions enable interoperability and co-operation between different repositories/archives Tools have to be implemented in the archiving system and archiving workflow. Preservation has to come to practice!
  • 31.
    Legal Aspects Copyrightand other intellectual property rights (IPR) have a substantial impact on digital preservation Preservation of digital materials is dependent on a range of strategies, which has implications for IPR in those materials Consideration may need to be given not only to content but to any associated software Specific permissions may be very challenging e.g. for webarchiving or digital art
  • 32.
    Legal Aspects: ExamplesWhat will be covered by legal deposit? How much is served from within the country? Strategy The national publication archive How are roles/responsibilities shared? Web archiving initiatives (e.g. European Web Archive) Development of electronic deposit systems International collaboration Other international repositories Levels of redundancy Access restrictions
  • 33.
    ..., but Digitalpreservation is often a legal grey area not yet understood or considered by legislators Lack of legal certainty should not prevent digital preservation actions Take action to manage risks
  • 34.
    Trusted Repositories Whytrusted repositories? It is very easy to manipulate digital information The users need to trust the accessed information Nobody is able to preserve everything – distributed preservation management Criteria of trusted repositories (i.e. TRAC , nestor) Administrative responsibility Financial sustainability Technical security ...
  • 35.
    Thank you verymuch for your attention! Comments? Questions? Stefan Strathmann Göttingen State and University Library [email_address]
  • 36.
    Exercise Which Digital Preservation issues are relevant in the context of your Digital Collection? How are they relevant? Data creation? Data management (collection management)? Data storage? Data documentation and description? Data preservation? Data use? Rights management? ... Try to describe a digital preservation Framework for your institution.
  • 37.
    Session Outline 10:00– 10:45 Lecture 10:45 – 11:00 Discussion 11:00 – 11:30 Coffee Break 11:30 – 12:30 Group Work 12:30 – 13:00 Groups present their results 13:00 – 13:15 Summary discussion