SCAPEEvaluation of formatidentification toolsJohan van der KnijffKoninklijke Bibliotheek – National Library of the Netherl...
SCAPEBackground & context
SCAPE                  About SCAPESCAlable Preservation EnvironmentsEU funded FP7 project; 16 partnersScalable services fo...
SCAPE    Importance of identificationObject         Automated          Alert                 watchObject                  ...
SCAPE        Evaluation of identification toolsWhich tools suitable for SCAPE architecture?Specific strengths/weaknessesDe...
SCAPE                          ToolsDROID 6.0FIDO 0.93/0.95Unix File Utility 5.0.3FITS 0.5 (uses DROID 3.0)JHOVE2 (uses DR...
SCAPE                Evaluation frameworkTotal of 22 criteria, broadly covering:   Usability in automated workflow (interf...
SCAPE                Key principlesHands-on testingusing real data is  essential!
SCAPEKey principles          Inform tool developers             on results, and give             them opportunity to      ...
SCAPEPerformance tests
SCAPE     Test data: KB Scientific Journals setN      size(min) size(median) size(max)   Total size11,892 11        4,737 ...
SCAPEOne file per tool invocationtool     File 1     Outputtool     File 2     Outputtool     File n     Output
SCAPE   Many files per tool invocation               File 1               File 2tool                            Output    ...
SCAPEOne-file per invocation
SCAPEOne-file per invocation
SCAPEOne-file per invocation
SCAPE      Comparison: one-file per invocationFile FIDO     FITS JHOVE2 DROID
SCAPE     Comparison: many-files per invocationFile DROID FIDO                                     JHOVE2
SCAPE          Performance: main conclusionsAll tested Java-based tools slow for one-file-per-   invocation use casePerfor...
SCAPE           So is this really a problem?Depends on required throughputDepends on workflow interface (command line or J...
SCAPE              Apples vs oranges                          FITS, JHOVE2: wrappers;                             also fea...
SCAPEMiscellaneous observations
SCAPE               Other observationsSignature-based identification doesn’t work too well for   text-based formats (inclu...
SCAPE          Response to this work so farFIDO: version 0.9.6 released in October; fixes most  reported issuesFITS: versi...
SCAPE                Possible next stepsImprove evaluation of accuracyKeep up with tool updates; keep this work up-to-date...
SCAPE                 Link to full report on OPF blog:www.openplanetsfoundation.org/blogs/2011-09-21-evaluation-identifica...
Upcoming SlideShare
Loading in …5
×

Evaluation of format identification tools

789 views

Published on

Johan van der Knijff, The National Library of the Netherlands, presented his evaluation of format identification tools. He concluded by discussing the potential next steps for tools like DROID, following the results of his evaluation.

This talk was given as part of The Future of File Format Identification: PRONOM and DROID User Consultation, in collaboration with the Digital Preservation Coalition at The National Archives, UK, on 28 November 2011.

Listen to the full presentation at http://media.nationalarchives.gov.uk/index.php/johan-van-der-knijff-evaluation-of-format-identification-tools/

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
789
On SlideShare
0
From Embeds
0
Number of Embeds
70
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Evaluation of format identification tools

  1. 1. SCAPEEvaluation of formatidentification toolsJohan van der KnijffKoninklijke Bibliotheek – National Library of the Netherlandsjohan.vanderknijff@kb.nlThe Future of File Format IdentificationDPC / The National Archives, Kew, London, 28.11.2011
  2. 2. SCAPEBackground & context
  3. 3. SCAPE About SCAPESCAlable Preservation EnvironmentsEU funded FP7 project; 16 partnersScalable services for preservation and preservation planningSemi-automated workflows for large-scale, heterogeneous collections of complex digital objects
  4. 4. SCAPE Importance of identificationObject Automated Alert watchObject Object migration (new format)Object Emulated emulation performance What is it? Identification!
  5. 5. SCAPE Evaluation of identification toolsWhich tools suitable for SCAPE architecture?Specific strengths/weaknessesDecide on needed enhancements and modificationsHopefully provide some useful input to developers as well!
  6. 6. SCAPE ToolsDROID 6.0FIDO 0.93/0.95Unix File Utility 5.0.3FITS 0.5 (uses DROID 3.0)JHOVE2 (uses DROID 4.0)
  7. 7. SCAPE Evaluation frameworkTotal of 22 criteria, broadly covering: Usability in automated workflow (interface, dependencies) Fit to requirements archival setting: format coverage, extendibility, accuracy Output: format, identifiers, granularity User documentation Performance, stability and error handling
  8. 8. SCAPE Key principlesHands-on testingusing real data is essential!
  9. 9. SCAPEKey principles Inform tool developers on results, and give them opportunity to provide feedback
  10. 10. SCAPEPerformance tests
  11. 11. SCAPE Test data: KB Scientific Journals setN size(min) size(median) size(max) Total size11,892 11 4,737 25,495,289 1.15 GB
  12. 12. SCAPEOne file per tool invocationtool File 1 Outputtool File 2 Outputtool File n Output
  13. 13. SCAPE Many files per tool invocation File 1 File 2tool Output File n
  14. 14. SCAPEOne-file per invocation
  15. 15. SCAPEOne-file per invocation
  16. 16. SCAPEOne-file per invocation
  17. 17. SCAPE Comparison: one-file per invocationFile FIDO FITS JHOVE2 DROID
  18. 18. SCAPE Comparison: many-files per invocationFile DROID FIDO JHOVE2
  19. 19. SCAPE Performance: main conclusionsAll tested Java-based tools slow for one-file-per- invocation use casePerformance much better for many-files-per-invocation use caseSlow initialisation seems to be main culprit - Actual processing time per file: milliseconds - Tool initialisation time: several seconds!
  20. 20. SCAPE So is this really a problem?Depends on required throughputDepends on workflow interface (command line or Java API)Depends on organisation of workflowDepends on purpose (e.g. pre-ingest vs profiling of large file collections)
  21. 21. SCAPE Apples vs oranges FITS, JHOVE2: wrappers; also feature extraction and validationDROID 6, FIDO: recurse into ZIP files ; File doesn’t!
  22. 22. SCAPEMiscellaneous observations
  23. 23. SCAPE Other observationsSignature-based identification doesn’t work too well for text-based formats (including XML)File outperforms other tools on format coverage and performance; management of signatures (‘magic’ file) awkwardDROID 6 output handling clumsy in automated workflows (separate DROID invocation needed for exporting profile information!)
  24. 24. SCAPE Response to this work so farFIDO: version 0.9.6 released in October; fixes most reported issuesFITS: version 0.6 released in October; various enhancements based on outcome of evaluationDROID, JHOVE2: both provided feedback and will consider test results for upcoming releases
  25. 25. SCAPE Possible next stepsImprove evaluation of accuracyKeep up with tool updates; keep this work up-to-datePublish all used scripts and detailed description of analysis methods so others can contribute more easilyUse publicly available test corpus (e.g. Govdocs1)
  26. 26. SCAPE Link to full report on OPF blog:www.openplanetsfoundation.org/blogs/2011-09-21-evaluation-identification-tools-first-results-scape More about SCAPE: http://www.scape-project.eu #SCAPEProject

×