Your SlideShare is downloading. ×
Evaluation of format identification tools
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Evaluation of format identification tools

368
views

Published on

Johan van der Knijff, The National Library of the Netherlands, presented his evaluation of format identification tools. He concluded by discussing the potential next steps for tools like DROID, …

Johan van der Knijff, The National Library of the Netherlands, presented his evaluation of format identification tools. He concluded by discussing the potential next steps for tools like DROID, following the results of his evaluation.

This talk was given as part of The Future of File Format Identification: PRONOM and DROID User Consultation, in collaboration with the Digital Preservation Coalition at The National Archives, UK, on 28 November 2011.

Listen to the full presentation at http://media.nationalarchives.gov.uk/index.php/johan-van-der-knijff-evaluation-of-format-identification-tools/

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
368
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. SCAPEEvaluation of formatidentification toolsJohan van der KnijffKoninklijke Bibliotheek – National Library of the Netherlandsjohan.vanderknijff@kb.nlThe Future of File Format IdentificationDPC / The National Archives, Kew, London, 28.11.2011
  • 2. SCAPEBackground & context
  • 3. SCAPE About SCAPESCAlable Preservation EnvironmentsEU funded FP7 project; 16 partnersScalable services for preservation and preservation planningSemi-automated workflows for large-scale, heterogeneous collections of complex digital objects
  • 4. SCAPE Importance of identificationObject Automated Alert watchObject Object migration (new format)Object Emulated emulation performance What is it? Identification!
  • 5. SCAPE Evaluation of identification toolsWhich tools suitable for SCAPE architecture?Specific strengths/weaknessesDecide on needed enhancements and modificationsHopefully provide some useful input to developers as well!
  • 6. SCAPE ToolsDROID 6.0FIDO 0.93/0.95Unix File Utility 5.0.3FITS 0.5 (uses DROID 3.0)JHOVE2 (uses DROID 4.0)
  • 7. SCAPE Evaluation frameworkTotal of 22 criteria, broadly covering: Usability in automated workflow (interface, dependencies) Fit to requirements archival setting: format coverage, extendibility, accuracy Output: format, identifiers, granularity User documentation Performance, stability and error handling
  • 8. SCAPE Key principlesHands-on testingusing real data is essential!
  • 9. SCAPEKey principles Inform tool developers on results, and give them opportunity to provide feedback
  • 10. SCAPEPerformance tests
  • 11. SCAPE Test data: KB Scientific Journals setN size(min) size(median) size(max) Total size11,892 11 4,737 25,495,289 1.15 GB
  • 12. SCAPEOne file per tool invocationtool File 1 Outputtool File 2 Outputtool File n Output
  • 13. SCAPE Many files per tool invocation File 1 File 2tool Output File n
  • 14. SCAPEOne-file per invocation
  • 15. SCAPEOne-file per invocation
  • 16. SCAPEOne-file per invocation
  • 17. SCAPE Comparison: one-file per invocationFile FIDO FITS JHOVE2 DROID
  • 18. SCAPE Comparison: many-files per invocationFile DROID FIDO JHOVE2
  • 19. SCAPE Performance: main conclusionsAll tested Java-based tools slow for one-file-per- invocation use casePerformance much better for many-files-per-invocation use caseSlow initialisation seems to be main culprit - Actual processing time per file: milliseconds - Tool initialisation time: several seconds!
  • 20. SCAPE So is this really a problem?Depends on required throughputDepends on workflow interface (command line or Java API)Depends on organisation of workflowDepends on purpose (e.g. pre-ingest vs profiling of large file collections)
  • 21. SCAPE Apples vs oranges FITS, JHOVE2: wrappers; also feature extraction and validationDROID 6, FIDO: recurse into ZIP files ; File doesn’t!
  • 22. SCAPEMiscellaneous observations
  • 23. SCAPE Other observationsSignature-based identification doesn’t work too well for text-based formats (including XML)File outperforms other tools on format coverage and performance; management of signatures (‘magic’ file) awkwardDROID 6 output handling clumsy in automated workflows (separate DROID invocation needed for exporting profile information!)
  • 24. SCAPE Response to this work so farFIDO: version 0.9.6 released in October; fixes most reported issuesFITS: version 0.6 released in October; various enhancements based on outcome of evaluationDROID, JHOVE2: both provided feedback and will consider test results for upcoming releases
  • 25. SCAPE Possible next stepsImprove evaluation of accuracyKeep up with tool updates; keep this work up-to-datePublish all used scripts and detailed description of analysis methods so others can contribute more easilyUse publicly available test corpus (e.g. Govdocs1)
  • 26. SCAPE Link to full report on OPF blog:www.openplanetsfoundation.org/blogs/2011-09-21-evaluation-identification-tools-first-results-scape More about SCAPE: http://www.scape-project.eu #SCAPEProject