PRESERVATION Web archiving


  1. 1. PRESERVATION Dr. Essam Obaid
  2. 2. PRESERVATIONThe purpose of preservation is to ensure the continued accessibility ofan object overtime. Successful preservation requires that- The object be accessible to users, and- It retain its unique value to users. Physical materials suffer damage and decay: the acid present in the paper damage its fibers, causing it to become brittle and discolored overtime. Such concerns also apply to digital objects. The physical storage media will degrade over time or may become corrupted overtime.
  3. 3. Digital Information Digital information is saved in the form of bits (ones and Zeroes) which represents the value in binary notation. Suchinformation cannot be directly interpreted by the user, but rather requires mediation of software capable of translating that information into human readable form. (Look at fig 6.1 on page 83) Digital Preservation Digital preservation requires the management of objects overtime, using techniques that may result in frequent and profound changes to the technical representation of that record. There is no significant difference between the preservation of web resources and any other digital object, and the same techniques can be applied in each case.
  4. 4. Long-term Sustainable MethodologiesPreservation is concerned with preserving the means of access toa digital objecto Migrationo Emulation
  5. 5. Emulation Emulation is the process of creating a ‘virtual’ version of the original environment that was used to access a given file. Thevirtualized environment is accessed via an emulation application on modern hardware and software. This allows access to theoriginal content to be maintained (without changing this content), through the emulated computer. Emulation attempts to retain the experience, and the original form of the data, and to a degree the performance, but does not necessarily retain the original form or performance of the hardware.
  6. 6. How it Emulation Works:1. A contemporary (latest) access environment for a digital object is encapsulated into an emulated (copy) environment;2. The emulated environment is accessed using a current hardware and software platform; and3. By using the current hardware and software platform to access the emulated environment, the emulated environment is used to access the target file.
  7. 7. MigrationMigration is the process of converting a piece of digital content from its original file format into a new format that can more easily be accessed without having to maintain contemporary software and hardware. The basic premise is that the file formatneeds to be changed. It might be preferable to store the properties that have been identified as significant across multiple files, or using multiple storage mechanisms (e.g., a file and a database).
  8. 8. How Migration works1. Original file format is acquired; and2. File Format is changed to another format.
  9. 9. RendersAn application that runs with current hardware and software is usedto access the digital object.• The software itself could either be written internally, or procured from another party.• It could either be a first party application, if it is written by the same Organization responsible for creating the file format, or a third party application in all other cases.
  10. 10. PRONOMThe National Archives (TNA) has been actively collecting,preserving, and making available electronic records for nearly 10years. TNA’s approach to digital preservation is founded on twofundamental activities:- Passive preservation: which provides secure storage, and- Active preservation: which ensures the continued accessibility of the stored records over time, and across changing technologies.
  11. 11. Active preservationActive preservation generates new technical manifestations of objects through processes such as format migration or emulation, to ensure their continued accessibility within changing technological environments.
  13. 13. Technical Registries A registry is an information source that provides a commonreference point for a particular community of users. By registering thekey information concepts, the community can benefit from a sharedunderstanding of what those concepts mean; in effect it provides acommon vocabulary. In case of technical registry in digital reservation, these conceptsrelate to the technical dependencies of digital objects.For example, if an object is described as being in JPEG format, andanother is described as being JFIF1.02 format, how can we tell that boththe formats are same. A file format registry containing standard definitions of eachformat, provides a solution: if everyone describes formats with referenceto the registry to the registry definition then all ambiguity is removed. Astandard referencing mechanism can be provided if each registry recordis also assigned a persistent unique identifier.
  14. 14. Technical RegistriesNot only file formats benefit from registries their use canpotentially be extended to every element of the representationnetwork, including- character encoding schemes,- Compression algorithms- Software- Operating systems- Hardware and storage mediaPRONOM the first such operational registry was developed by “THE NATIONAL ARCHIVES” of the UK (TNA) in 2003 and isavailable as a free online resource.
  15. 15. CharacterizationBefore any object can be preserved, it must be understood withsufficient technical precision. Specifically, it is necessary tounderstand the significant properties of the object, which must bepreserve over time if it is to be regarded as authentic, and itstechnical characteristics, which will influence the specificpreservation strategies which may be employed. For example: theresolution and color depth of a image are likely to be consideredfundamental properties to preserve.Characterization comprises three discrete stages:- Identification- Validation- Property Extraction
  16. 16. Identification : Identification typically performed using some form of signatures, a digital ‘finger print’ which is unique to a specific format. The simplest signature is provided by a file extension. DROID : (Digital Record Object Identification) software developed by TNA is an example of an identification tool that uses both internal and external signatures to perform automated batch identification formats.Validation: This determine whether the object is well formed and valid against its formal specification.Property Extraction : The properties of the object which are significant to its long term preservation.
  17. 17. Preservation PlanningPreservation planning forms the decision making of activepreservation. Its role is to identify and monitor technologicalchanges and their potential impacts on stored digital objects, andto develop the necessary detailed preservation plans to mitigateagainst those impacts.
  18. 18. Preservation ActionPreservation action represents the enactment of the preservationplan in accordance with the chosen preservation strategy. Thiswill entail either the migration of objects to new formats or thedevelopment of emulated environments. whatever preservationplan is adopted, preservation action requires the availability ofspecialized software tools.
  19. 19. Passive PreservationPassive preservation is concerned with the secure storage ofdigital objects, and the prevention of accidental or unauthorizeddamage or loss. As such, passive preservation needs toencompass the following functions: {Brown, A. 2006}a. Security and access controlb. Integrityc. Storage managementd. Content managemente. Disaster recovery
  20. 20. Tools for Passive PreservationWith journal prices, especially in the science, technical andmedical (STM) sector, still out of control, more and more authorsand universities want to take an active part in the publishing andpreservation process themselves.In picking a tool, a library has to consider a number of questions:• What material should be stored in the repository?• Is long-term preservation an issue?• Which software should be chosen?• What is the cost of setting the system up? and• How much know-how is required?
  21. 21. What is the LOCKSS Program?LOCKSS (Lots of Copies Keep Stuff Safe), based at StanfordUniversity Libraries, is an international community initiative thatprovides libraries with digital preservation tools and support sothat they can easily and inexpensively collect and preserve theirown copies of authorized e-content. LOCKSS, in its eleventhyear, provides libraries with the open-source software andsupport to preserve today’s web-published materials fortomorrow’s readers while building their own collections andacquiring a copy of the assets they pay for, instead of simplyleasing them. LOCKSS provides 100% post cancellation access.
  22. 22. EPrints EPrints is a tool that is used to manage the archiving of research in theform of books, posters, or conference papers. Its purpose is not to provide a long-term archiving solution that ensures that material will be readable and accessiblethrough technology changes, but instead to give institutions a means to collect,store and provide Web access to material. Currently, there are over 140 repositories worldwide that run the EPrintssoftware. For example, at the University of Queensland in Australia, EPrints isused as a deposit collection of papers that showcases the research output of UQacademic staff and postgraduate students across a range of subjects anddisciplines, both before and after peer-reviewed publication. EPrints is a free open source package that was developed at theUniversity of Southampton in the UK
  23. 23. DSpaceThe DSpace open source software has been developed by theMassachusetts Institute of Technology Libraries and Hewlett-Packard. The current version of DSpace is 1.2.1.According to the DSpace Web site the software allowsinstitutions to capture and describe digital works using a customworkflow process distribute an institutions digital works over theWeb, so users can search and retrieve items in the collectionpreserve digital works over the long term
  24. 24. Future Trends International Standards With the rapid development of information andcommunication environment, numerous intellectual works areavailable in digital format on the Internet, and those digitalresources have disappearing tendencies soon after theirappearance. Digital archiving is the long-term procedure toprocess, manage and preserve those digital objects, which areconsidered to have timeless value. Since 1990s, as their long-term national projects, many countries like Australia, the UnitedStates, and European nations have progressed their onlinepreservation efforts for digital resources led by their nationallibraries with cooperation from other institutions and organizations.
  25. 25. OASIS The National Library of Korea (NLK), with the change ofstatus of libraries in digital information era, has planned an efficientnational information service to the people with collection of qualityonline digital information and provision of public service, to preservethose intellectual records for the next generations to come. For the opening of the National Digital Library of Korea in2008, to collect various web contents, NLK is working on a project foronline digital resource collection and preservation, OASIS (OnlineArchiving & Searching Internet Sources TheOASIS system was developed in December 2005, to preserve onlinedigital resource for the future generation, to collect and preservenational digital cultural heritage, and to establish standard managementpolicies for the digital resources.
  26. 26. OASIS(Online Archiving & Searching Internet Sources
  27. 27. OASIS Approach for Web Resource Collection Selective Collection of Web ResourcesNLKs approach for web archiving is basically a selectivecollection. Currently we have two types of objects to collect:Web sites and Individual web digital resources. They are beingselectively collected by an established collection developmentpolicy. We will expand the target objects into video, image, andaudio gradually.OASIS Collection Target and Collection PolicyThe selection of target resources was based on the utility for thecurrent or the future information need, authors popularity, theuniqueness of information, academic contents, being up-to-dateof the information, frequency of upgrading, and the accessibility.
  28. 28. OASIS Annual Resource Collection StatisticsThe collection started in 2004 and currently OASIS has156,798 resources in total. The collection size is about 2.4 terabytes.Table 1. OASIS Resources Collection Statistics (Number of Titles)Type of Resources 2004 2005 2006 TotalIndividual Digital 43,861 45,280 42,958 132,099 ResourceWeb Site 1,218 2,716 20,765 24,699Total 45,079 47,996 63,72 156,798
  29. 29. OASIS Workflow and Process OASIS workflows and processes are described for web sites and individual digital resources respectively.The process for web sites does not finalize with one cycle for mirroring because web sites change their contents continuously. It is necessary to collect their resources to preserve them by certain time periods. However, it isimpossible for a manager to monitor numerous web sites changes manually, and it is considered a waste of resources to collect every resource unconditionally by acertain interval to preserve, for example, one month, two months, or six months.
  30. 30. Fig. 1. Workflow for Website ArchivingThe selected individual digital resources are collected by a robot. The robot collects the target resources, checks duplicity, automatically classifies them according to the classification system and extracts abstract information. For the processed individual resources, the manager inputs various metadata, reviews and corrects to make final catalog to preserve.
  31. 31. Future Development Direction• As knowledge information resources migrate from paper to digital formats, increasing necessity is found for collection and preservation of digital knowledge information resources at the national level. Recognizing digital resources being short-lived, the OASIS system is running at the national level led by NLK to collect and preserve valuable digital resources for the current generation to inherit to the next generation as digital cultural heritage.• To accomplish the mission, the OASIS system provides national standard models for submission of online digital resources to the authority in the future digital environment and for standardization of collection and preservation systems for online digital resources.• Major development technologies are applied to OASIS at the levels of collection, preservation, management, public service, etc. They include the development of web robot agents and techniques to use them, automatic classification and automatic abstracting and others for the collection process.