This presentation was given by Will Palmer at ‘SCAPE Information Day at the British Library’, on 14 July 2014. The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants.
In this presentation Will Palmer introduced the SCAPE developed tool Nanite which can help institutions analyze their web archive data.
[2024]Digital Global Overview Report 2024 Meltwater.pdf
SCAPE Information Day at BL - Characterising content in web archives with Nanite
1. Characterising content in web archives with Nanite
William Palmer SCAPE Information Day British Library, UK, 14th July 2014
2. •When web sites are harvested they are stored in a container format
•The main web archive container formats are ARC and WARC (an ISO standard)
•They are effectively analogous to a zip file
2
Web Archives
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
WARC Container
3. •Web archives can hold billions of individual records
•To answer deeper questions you have to determine what data is held
•Not the same as a homogenous collection of images
•Can contain everything and anything
•Correctly formed files
•Malformed files
•Viruses
•Unknown files?
•You name it
3
Characterisation
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
?
?
?
?
JPG
GIF
TXT
XLS
4. Nanite
•Nanite is formed of two main modules
•nanite-core: a Java API for the UK National Archives’ Droid
•nanite-hadoop: WARC content characterisation using Hadoop
•Apache Tika (Detector), Nanite-core & libmagic-jni (‘file’)
•Optionally use Tika (Parsers); data output to sequence files
•Also list server content type & file extension
•Reuses: warc-hadoop-recordreaders (partially SCAPE)
5. Speed
•Fast: for 1TB, 14k warcs, 93m files; mimetypes detected in 17 hours
•Nanite has also been used at the Danish State and University Library
•7.3TB data, 80k ARC files, 261m files
•Identification using Droid and Tika
•Characterisation using Tika
•…in 32 hours
•Same platform but using FITS (not using Hadoop, but parallelised):
•12TB data, 100k ARC files, 400m files
•An entire year of processing (8760 hours)
Map
Tika Identify
Nanite/ Droid
Libmagic
Tika Parser
6. Stats
•1370 different MIME types reported by the original servers
•Tika detected 342
•DROID detected 319
•Additional information in this blog post: http://www.openplanetsfoundation.org/blogs/2014- 05-28-weekend-nanite