WebArchiv - Archive of the Czech Web
5. 6. 2014
WebArchiv
• a digital archive of Czech web resources
• purposes of web archiving:
• growth of electronic online resourses
...
Department of Web Archiving
History
• project started in 2000
• first document harvested in 3. 9. 2001
• IIPC member from 2007
!
• since 2008 part of N...
Today
• 87 TB archived data
• whole archive accesible in the library
• only selective harvests accesible online
• more the...
Legal Issues
• Legal deposit act - doesn’t cover online-born
documents
• Copyright act - only the library licence which al...
Web Archive Content
1. Comprehesive harvests
2. Selective harvests
3. Topic collections
Comprehensive harvests
• contract with czech domain provider CZ.NIC
!
• once a year crawl of the whole .cz domain
• accesi...
Selective Harvests
• selective approach:
• territory
• language
• autorship
• topic/content
• curated resourses
• crawled ...
Topic Collections
• collection of resources which are related to certain event
of topic
• for example:
• presidential elec...
Workflow
• selecting and evaluating
• contracting with publishers
• harvesting
• access and quality assurance
Software
• crawler: Hertrix
• access: Open Wayback
• web curator tool: WA admin
!
• https://github.com/WebArchivCZ/
Thank you for you attention.
!
!
Barbora Bjačková
barbora.bjackova@nkp.cz
!
Jaroslav Kvasnica
jaroslav.kvasnica@nkp.cz
htt...
Upcoming SlideShare
Loading in …5
×

WebArchiv - Archive of the Czech Web

1,622 views
1,560 views

Published on

Presentation about czech web archive.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,622
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

WebArchiv - Archive of the Czech Web

  1. 1. WebArchiv - Archive of the Czech Web 5. 6. 2014
  2. 2. WebArchiv • a digital archive of Czech web resources • purposes of web archiving: • growth of electronic online resourses • long-term preservation • at-risk content on web
  3. 3. Department of Web Archiving
  4. 4. History • project started in 2000 • first document harvested in 3. 9. 2001 • IIPC member from 2007 ! • since 2008 part of National Digital Library
  5. 5. Today • 87 TB archived data • whole archive accesible in the library • only selective harvests accesible online • more then 4000 archived websites with online access • 3 people in the deparment + 1 IT guy • focus on long-term preservation
  6. 6. Legal Issues • Legal deposit act - doesn’t cover online-born documents • Copyright act - only the library licence which allows library to make a reproduction of a work for own archiving or conservation purpose • Online access - based on contracts with publishers or on Creative Commons licence
  7. 7. Web Archive Content 1. Comprehesive harvests 2. Selective harvests 3. Topic collections
  8. 8. Comprehensive harvests • contract with czech domain provider CZ.NIC ! • once a year crawl of the whole .cz domain • accesible only in the library • a maximum of 5000 harvested files per site
  9. 9. Selective Harvests • selective approach: • territory • language • autorship • topic/content • curated resourses • crawled periodically (several frequencies) • communication with publishers • online access • cataloging
  10. 10. Topic Collections • collection of resources which are related to certain event of topic • for example: • presidential elections • floods • olympic games
  11. 11. Workflow • selecting and evaluating • contracting with publishers • harvesting • access and quality assurance
  12. 12. Software • crawler: Hertrix • access: Open Wayback • web curator tool: WA admin ! • https://github.com/WebArchivCZ/
  13. 13. Thank you for you attention. ! ! Barbora Bjačková barbora.bjackova@nkp.cz ! Jaroslav Kvasnica jaroslav.kvasnica@nkp.cz http://www.webarchiv.cz

×