Hadar Miller, National Library of Israel:
“ArchioNet” Israel Internet DomainArchive
pptx file of the presentation at the
EVA/Minerva Jerusalem International Conference on Digitisation of Culture,
Jerusalem, The Jerusalem Van Leer Institute, 12-13 November 2013
http://www.digital-heritage.org.il
Presentations available at: http://2013.minervaisrael.org.il
4. “ArchioNet” Project scope
•
•
Why do we need this project ?
What do we harvest?
•
•
•
Phase A : “Way back machine” in NLI Only , “Archionet” Only.
Phase B : Over the Web , Cross Reference Discovery.
When we started?
•
•
•
Phase b : Hebrew characters sites
How to enable accessibility:
•
•
•
Phase A : .IL web site
Phase A : 2 full crawl annually started September 2013
Phase B : additional 4 subject based crawl annually.
Where to execute the harvest ?
•
•
Phase A : NLI with Internet Archive.
Phase B : NLI Infrastructure
5. Technical Issues
•
•
•
Which Crawler ( version ) to use ?
Cataloguing and Search tool
What to harvest ?
•
•
•
•
•
•
•
Seeds is needed
Depth of a site
Robots.txt
The Deep Web
How to store and preserve a WARC file
Virus Detection
System Architecture
6. The Project in Numbers
• ~220K web sits
• 0.5 Giga byte/Site
• ~100 Tera / Harvest
• Avg page lifetime ~ 100 days
• 2 Full Harvest - Annually
7. Legislation
• Can NLI Harvest
• Where is it accessible ?
• Intellectual Properties
• What can/should we block ?