F1 hadar miller__israeli_internet_archive-nli

“ArchioNet”
Israeli Internet Domain
Archive

Agenda

o NLI Digital Library Infrastructure
o “ArchioNet” Project Scope
o Technical Issues
o The Project in Numbers
o Legislation
o What’s Next

NLI Digital Library Infrastructure

“ArchioNet” Project scope
•
•

Why do we need this project ?
What do we harvest?

•
•
•

Phase A : “Way back machine” in NLI Only , “Archionet” Only.
Phase B : Over the Web , Cross Reference Discovery.

When we started?

•
•
•

Phase b : Hebrew characters sites

How to enable accessibility:

•
•
•

Phase A : .IL web site

Phase A : 2 full crawl annually started September 2013
Phase B : additional 4 subject based crawl annually.

Where to execute the harvest ?

•
•

Phase A : NLI with Internet Archive.
Phase B : NLI Infrastructure

Technical Issues

•
•
•

Which Crawler ( version ) to use ?
Cataloguing and Search tool

What to harvest ?

•
•
•
•
•
•
•

Seeds is needed
Depth of a site

Robots.txt
The Deep Web

How to store and preserve a WARC file
Virus Detection
System Architecture

The Project in Numbers

• ~220K web sits
• 0.5 Giga byte/Site
• ~100 Tera / Harvest
• Avg page lifetime ~ 100 days
• 2 Full Harvest - Annually

Legislation

• Can NLI Harvest
• Where is it accessible ?
• Intellectual Properties
• What can/should we block ?

F1 hadar miller__israeli_internet_archive-nli

F1 hadar miller__israeli_internet_archive-nli

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

More from evaminerva

More from evaminerva (20)

Recently uploaded

Recently uploaded (20)

F1 hadar miller__israeli_internet_archive-nli