SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 14 day free trial to unlock unlimited reading.
Hadar Miller, National Library of Israel:
“ArchioNet” Israel Internet DomainArchive
pdf file of the presentation at the
EVA/Minerva Jerusalem International Conference on Digitisation of Culture,
Jerusalem, The Jerusalem Van Leer Institute, 12-13 November 2013
http://www.digital-heritage.org.il
Presentations available at: http://2013.minervaisrael.org.il
Hadar Miller, National Library of Israel:
“ArchioNet” Israel Internet DomainArchive
pdf file of the presentation at the
EVA/Minerva Jerusalem International Conference on Digitisation of Culture,
Jerusalem, The Jerusalem Van Leer Institute, 12-13 November 2013
http://www.digital-heritage.org.il
Presentations available at: http://2013.minervaisrael.org.il
4.
“ArchioNet” Project scope
•
•
Why do we need this project ?
What do we harvest?
•
•
•
Phase A : “Way back machine” in NLI Only , “Archionet” Only.
Phase B : Over the Web , Cross Reference Discovery.
When we started?
•
•
•
Phase b : Hebrew characters sites
How to enable accessibility:
•
•
•
Phase A : .IL web site
Phase A : 2 full crawl annually started September 2013
Phase B : additional 4 subject based crawl annually.
Where to execute the harvest ?
•
•
Phase A : NLI with Internet Archive.
Phase B : NLI Infrastructure
5.
Technical Issues
•
•
•
Which Crawler ( version ) to use ?
Cataloguing and Search tool
What to harvest ?
•
•
•
•
•
•
•
Seeds is needed
Depth of a site
Robots.txt
The Deep Web
How to store and preserve a WARC file
Virus Detection
System Architecture
6.
The Project in Numbers
• ~220K web sits
• 0.5 Giga byte/Site
• ~100 Tera / Harvest
• Avg page lifetime ~ 100 days
• 2 Full Harvest - Annually
7.
Legislation
• Can NLI Harvest
• Where is it accessible ?
• Intellectual Properties
• What can/should we block ?