Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

A presentation about web archiving projects end-user perspective review, as well about web archiving in Serbia, presented at VIII National conference of National center for digitization, Belgrade, Serbia, April 16, 2009.

Published in: Education, Technology
  • Be the first to comment


  1. 1. WEB ARCHIVING PROJECTS END-USER PERSPECTIVE Bogdan Trifunovic, M. A. Digitization Center Public Library Cacak [email_address]
  2. 2. The purpose of research <ul><li>Examines usability and accessibility of the publicly opened web archiving projects </li></ul><ul><li>Identifying user-friendly features associated with the web sites of several web archiving projects, but also the creation of basic structure and framework for comparative analysis </li></ul><ul><li>Raising awareness about web archiving </li></ul>
  3. 3. INTERNET ARCHIVE <ul><li> </li></ul><ul><li>Established in 1996 as non-profit organization (private funding) </li></ul><ul><li>Oldest web archiving project, using Alexa crawler (robot) for creating the snapshots of entire WWW </li></ul><ul><li>The sheer size of Internet doesn’t allow capturing everything online </li></ul>
  4. 4. Newer approaches <ul><li>Mostly dealing with the “national” part of WWW (e.g. capturing and archiving national domain, digital preservation of “web heritage”) </li></ul><ul><li>Run by major national institutions (libraries, consortia) </li></ul><ul><li>Selective approach of identifying quality Internet content, which satisfies established standards </li></ul>
  5. 5. Web Archiving projects <ul><li>PANDORA (National Library of Australia) </li></ul><ul><li>EUROPEAN ARCHIVE (non-profit) </li></ul><ul><li>MINERVA (Library of Congress) </li></ul><ul><li>UK WEB ARCHIVE (British Library) </li></ul><ul><li>WEBARCHIV (National Library of the Czech Republic) </li></ul><ul><li>*All projects were reviewed in November 2008 </li></ul>
  6. 6. PANDORA <ul><li> </li></ul><ul><li>PANDAS (PANDORA Digital Archiving System) </li></ul><ul><li>HTTrack crawler </li></ul><ul><li>Excellent documentation, easily to navigate and browse collections </li></ul><ul><li>Basic and advance search options </li></ul><ul><li>Unlimited access to collections </li></ul>
  7. 7. EUROPEAN ARCHIVE <ul><li> </li></ul><ul><li>New project, still in development </li></ul><ul><li>Web 2.0 elements (tag cloud, my Desktop) </li></ul><ul><li>Internet Archive harvesting services </li></ul><ul><li>No search options for web archive, multilingual interface </li></ul><ul><li>Unlimited access </li></ul>
  8. 8. MINERVA <ul><li> </li></ul><ul><li>Harvest by Internet Archive </li></ul><ul><li>Thematic collections (US elections, war in Iraq, etc) </li></ul><ul><li>Restrictions on access to some collections (only from LOC) </li></ul>
  9. 9. UK WEB ARCHIVE <ul><li> </li></ul><ul><li>Established 2003 by six institutions as UK Web Archive Consortium, between 2005 and 2007 project had used PANDAS technology, from 2008 new web archiving system based on Web Curator Tool has been introduced </li></ul><ul><li>BL maintains project from 2008 </li></ul>
  10. 10. WEBARCHIV <ul><li> </li></ul><ul><li>Heritrix crawler </li></ul><ul><li>Archiving Czech web domain, access to collection of websites (900+) with signed contracts for public access, everything else only from NKP </li></ul><ul><li>No search option except by URL, content not indexed </li></ul>
  11. 11. Why archiving web <ul><li>General idea is that changing nature of WWW and instability of information on Internet should be preserved in some way, because that is part of national (digital) culture </li></ul><ul><li>Preservation of online documents (e.g., for citation accuracy) </li></ul><ul><li>Because there is huge growth of online material </li></ul>
  12. 12. Difficulties <ul><li>There are three important characteristics of the Web that make crawling it very difficult: </li></ul><ul><ul><li>its large volume, </li></ul></ul><ul><ul><li>its fast rate of change, and </li></ul></ul><ul><ul><li>dynamic page generation </li></ul></ul><ul><li>Identifying web content that should be preserved for future – the role of librarians, curators, archivists… </li></ul>
  13. 13. Serbia case <ul><li>The process of changing national domain from .yu to .rs domain has started in 2008 </li></ul><ul><li>By October 2009 all of .yu content (everything with .yu address) will permanently disappear from WWW </li></ul><ul><li>Thousands of web pages will be lost </li></ul><ul><li>There is no strategy of preserving them (but also no time) </li></ul>
  14. 14. Planning on a small scale <ul><li>Public library Cacak-Digitization Center created a short list of about 50 web sites of interest for us </li></ul><ul><li>We used HTTrack ( web crawler to locally archive them </li></ul><ul><li>It is possible to navigate all websites, where harvesting process was successful </li></ul>
  15. 18. Future steps <ul><li>Improving organizational framework for web archiving of local resources </li></ul><ul><li>Defining the legal setting – how to download and archive authorized material </li></ul><ul><li>Finding solutions for automatic archiving (partially solving the problem of staff shortages) </li></ul>
  16. 19. THANK YOU! QUESTIONS? <ul><li>Bogdan Trifunovic, M. A. Digitization Center Public Library Cacak [email_address] </li></ul>