De Conferentie 2005 Julien Masanes

570 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
570
On SlideShare
0
From Embeds
0
Number of Embeds
111
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

De Conferentie 2005 Julien Masanes

  1. 1. Web Archiving: a collaborative effort in progress<br />Julien Masanès<br />European Archive<br />
  2. 2. Why archiving the web<br />today the web is the main publishing medium (tenth of billions of pages) <br />all aspect of cultural, scientific, mundane production have traces on the web<br />it is a unique source of information on modern societies<br />
  3. 3. ephemeral<br />the half life of a page is less than 2 years<br />even academic publishing on the web is not stable <br />
  4. 4. What has changed?<br />(stable) & (discrete objects)<br />time<br />(publishers)<br />technique<br />online<br />ready for automatic processing<br />
  5. 5. Crawler<br />Archive<br />Web<br />User<br />
  6. 6.
  7. 7.
  8. 8.
  9. 9. www as a grid of servers<br />
  10. 10. Web archives grid (by redirection)<br />
  11. 11. IIPC: standards and tools for the Web archives grid<br />Standards<br />Architecture<br />Storage format (WARC)<br />Metadata<br />
  12. 12. HERITRIX<br />BAT<br />NUTCH-WAX<br />WERA<br />Focus crawler<br />Curator tool<br />
  13. 13. The Internet Archive<br />“Universal Access to Human Knowledge”<br />Digital repository accessible on the Internet<br />Started in 1996 to archive web pages<br />Best known for Wayback Machine<br />Expanded to include movies, books, concerts, and other audio<br />
  14. 14. Web Archive<br />Contents: <br />Snapshot every 2 months since 1996<br />Currently collect 4 billion pages per crawl (50TB/mo)<br />Collecting done by Alexa Internet, web cataloguing company<br />Storage, access and preservation done by Internet Archive<br />
  15. 15. The European digital Archive<br />Was incorporated in 2004 as a non-profit foundation in Amsterdam with public and private support<br />Technological and collection peering agreement with the Internet Archive <br />
  16. 16. EA’s 200 Tb data center in Amsterdam<br />
  17. 17. Our role<br />Open archive for the public<br />Technology partner for cultural institutions wishing to do web collections<br />Focus and domain Crawl<br />Access via online interface and search<br />Quality assurance and reporting on collections<br />Hosting and delivery of content<br />Preservation and backup<br />
  18. 18. Current or recent Web projects<br />EU referendum<br />British elections with British Library<br />German election with DDB<br />Pilot study on archiving of TV and Radio website with the Netherlands Audiovisual Archive (BeelendGeleid)<br />
  19. 19. IIPC: http://netpreserve.org<br />European Archive: 
http://europarchive.org (end of 2005)<br />Web Archive information list:
http://listes.cru.fr/sympa/info/web-archive<br />International Web Archiving Workshop (IWAW): http://iwaw.net<br />julien@europarchive.org<br />

×