Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

WebArchiv - Archive of the Czech Web

2,257 views

Published on

Presentation about czech web archive.

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

WebArchiv - Archive of the Czech Web

  1. 1. WebArchiv - Archive of the Czech Web 5. 6. 2014
  2. 2. WebArchiv • a digital archive of Czech web resources • purposes of web archiving: • growth of electronic online resourses • long-term preservation • at-risk content on web
  3. 3. Department of Web Archiving
  4. 4. History • project started in 2000 • first document harvested in 3. 9. 2001 • IIPC member from 2007 ! • since 2008 part of National Digital Library
  5. 5. Today • 87 TB archived data • whole archive accesible in the library • only selective harvests accesible online • more then 4000 archived websites with online access • 3 people in the deparment + 1 IT guy • focus on long-term preservation
  6. 6. Legal Issues • Legal deposit act - doesn’t cover online-born documents • Copyright act - only the library licence which allows library to make a reproduction of a work for own archiving or conservation purpose • Online access - based on contracts with publishers or on Creative Commons licence
  7. 7. Web Archive Content 1. Comprehesive harvests 2. Selective harvests 3. Topic collections
  8. 8. Comprehensive harvests • contract with czech domain provider CZ.NIC ! • once a year crawl of the whole .cz domain • accesible only in the library • a maximum of 5000 harvested files per site
  9. 9. Selective Harvests • selective approach: • territory • language • autorship • topic/content • curated resourses • crawled periodically (several frequencies) • communication with publishers • online access • cataloging
  10. 10. Topic Collections • collection of resources which are related to certain event of topic • for example: • presidential elections • floods • olympic games
  11. 11. Workflow • selecting and evaluating • contracting with publishers • harvesting • access and quality assurance
  12. 12. Software • crawler: Hertrix • access: Open Wayback • web curator tool: WA admin ! • https://github.com/WebArchivCZ/
  13. 13. Thank you for you attention. ! ! Barbora Bjačková barbora.bjackova@nkp.cz ! Jaroslav Kvasnica jaroslav.kvasnica@nkp.cz http://www.webarchiv.cz

×