Advertisement

F1 hadar miller__israeli_internet_archive-nli

Nov. 22, 2013
Advertisement

More Related Content

Advertisement

F1 hadar miller__israeli_internet_archive-nli

  1. “ArchioNet” Israeli Internet Domain Archive
  2. Agenda o NLI Digital Library Infrastructure o “ArchioNet” Project Scope o Technical Issues o The Project in Numbers o Legislation o What’s Next
  3. NLI Digital Library Infrastructure
  4. “ArchioNet” Project scope • • Why do we need this project ? What do we harvest? • • • Phase A : “Way back machine” in NLI Only , “Archionet” Only. Phase B : Over the Web , Cross Reference Discovery. When we started? • • • Phase b : Hebrew characters sites How to enable accessibility: • • • Phase A : .IL web site Phase A : 2 full crawl annually started September 2013 Phase B : additional 4 subject based crawl annually. Where to execute the harvest ? • • Phase A : NLI with Internet Archive. Phase B : NLI Infrastructure
  5. Technical Issues • • • Which Crawler ( version ) to use ? Cataloguing and Search tool What to harvest ? • • • • • • • Seeds is needed Depth of a site Robots.txt The Deep Web How to store and preserve a WARC file Virus Detection System Architecture
  6. The Project in Numbers • ~220K web sits • 0.5 Giga byte/Site • ~100 Tera / Harvest • Avg page lifetime ~ 100 days • 2 Full Harvest - Annually
  7. Legislation • Can NLI Harvest • Where is it accessible ? • Intellectual Properties • What can/should we block ?
  8. Thank You 
  9. Back
Advertisement