Per Møldrup-Dalum
State and University Library
SCAPE Information Day
State and University Library, Denmark, 2014-06-25
A W...
• A short introduction to the experiment
• A live demonstration
• A look at the data for characterisation
• A look at the ...
• Performance-testing the tools
• SCAPE User Story: As a Web Archive I need a Digital
Preservation System that can process...
• Apache Tika
• DROID from The National Archive
• (libmagic)
• Not a word on FITS...
4
Tools at Hand
This work was partial...
• Created and maintained by the British Library
• Improved by SCAPE and sustained by Open
Planets Foundation
• Tika and li...
6This work was partially supported by the SCAPE Project.
The SCAPE project is co‐funded by the European Union under FP7 IC...
• SCAPE User Story for web archive data: http://wiki.opf-
labs.org/display/SP/File+Format+Identification+and+Ch
aracterisa...
Upcoming SlideShare
Loading in …5
×

Identification and feature extraction of web archive data based on Nanite, SCAPE Information Day, 25 June 2014 day sb nanite experiment

297 views

Published on

At the ‘SCAPE Information Day at the State and University Library, Denmark’, on 25 June 2014 Per Møldrup-Dalum gave a presentation of how the State and University Library have extracted metadata from the Danish Netarchive. More information about the experiment can be found in this blog post: A Weekend With Nanite.
The information day introduced the EU-funded project SCAPE (Scalable Preservation Environments) and its tools and services to the participants. Read more about the event in this blog post, http://bit.ly/SCAPE_SB_Demo.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
297
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Identification and feature extraction of web archive data based on Nanite, SCAPE Information Day, 25 June 2014 day sb nanite experiment

  1. 1. Per Møldrup-Dalum State and University Library SCAPE Information Day State and University Library, Denmark, 2014-06-25 A Weekend with Nanite Large scale characterisation of web archives
  2. 2. • A short introduction to the experiment • A live demonstration • A look at the data for characterisation • A look at the input for the job • Run the job • Analysis of the output and of the run itself. 2 Agenda This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  3. 3. • Performance-testing the tools • SCAPE User Story: As a Web Archive I need a Digital Preservation System that can process both ARC and WARC files and identify file formats/characterize of items contained so that I can assess preservation risks and plan which tools will be required for access to those formats. 3 Task at Hand This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  4. 4. • Apache Tika • DROID from The National Archive • (libmagic) • Not a word on FITS... 4 Tools at Hand This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  5. 5. • Created and maintained by the British Library • Improved by SCAPE and sustained by Open Planets Foundation • Tika and libmagic support added • Advanced Tika support through a ”persistent” Tika server • ARC header extraction added • More to come… 5 Nanite This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  6. 6. 6This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  7. 7. • SCAPE User Story for web archive data: http://wiki.opf- labs.org/display/SP/File+Format+Identification+and+Ch aracterisation+of+Web+Archives • Nanite: https://github.com/openplanets/nanite • A Weekend With Nanite blog post: http://openplanetsfoundation.org/blogs/2014-05-28- weekend-nanite • Open Planets Blogs: http://openplanetsfoundation.org/blog 7 References This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

×