‘Persistently’ Identifying Formats    PRONOM, DROID and the NDHA                     Jay Gattuso            Digital Preser...
Summary How Rosetta uses DROID How DROID has changedResearch NDHA completed         Results   Recommendations
DROID & PRONOM• PRONOM is the most  widely used file format  registry in the sector• DROID is a tool that  ‘identifies’ fi...
Rosetta – A Brief History• NLNZ Digital Preservation  Repository• 4 years since inception• 18 months out of project• 8 sig...
Write Once, Read ManyInside Rosetta, formatidentification is a ‘WORM’ process.As a part of the ingestroutine, format ident...
Behaviours and functions based on    DROID format assertionsRosetta uses DROID toautomatically establishformat type.
Rosetta Overview Validation Stack  Automated FormatIdentification via DROID
Shape Sorting...Where:• The area inside the box  is Rosetta• Each block is a DO• Each shape is a format• The ‘Sorter’ is D...
Shape Sorting...Process:• A record is kept of the  ‘shape’ the DO entered the  box via• The record is used by the  system ...
Shape Sorting...Expectations:• The ‘Sorter’ never changes• The blocks never change• A DO placed in the box  yesterday will...
Shape Sorting...The reality for NDHA:• DROID has undergone 2  major revisions• Container signatures have  been included• S...
Identifying and Quantifying Change• Rosetta has used DROID versions  3 and 5, currently testing with 6• Rosetta has used D...
Identifying and Quantifying Change• Source set:   –   26,000 digital objects,   –   ~600 Gb of content,   –   spanning 61 ...
Identifying and Quantifying Change• 1 million DROID ‘assertions’ captured• Python and MySQL used to  sort, clean, filter, ...
Summary of ResultsOf the 61 tested file types :  75% performed identically  for all tested versions of  DROID and signatur...
Summary of ResultsOf the 61 tested file types :  40% consistently offered  a single PUID across the  range of DROID tests ...
Summary of ResultsOf the 61 tested file types :  In 26% of the file types  multiple PUIDs are  equally asserted by  DROID ...
Summary of ResultsOf the 61 tested file types :  In 16% of the file types  DROID version 6 in ‘FAST’  mode performs differ...
Recommendation 1There is a clear needfor a communityowned dataset thatspans the PRONOMcatalogue to supporttesting(This sho...
Recommendation 2It is stronglyrecommended thatmore research isundertaken looking atthe persistence ofPUID’s to give a more...
Recommendation 3Given the variancesobserved, especially withDROID v6 ‘FAST’ mode, itis recommended that allsignatures are ...
Recap How Rosetta uses DROID How DROID has changedResearch NDHA completed         Results   Recommendations
Thank you            jay.gattuso@dia.govt.nz     Rosetta demo – Wednesday 28th March    9am to 1pm @ NLNZ - 77 Thorndon Qu...
Upcoming SlideShare
Loading in …5
×

Jay Gattuso Persistently Identifying Formats

1,742 views

Published on

'Persistently’ Identifying Formats
PRONOM, DROID and the NDHA
Jay Gattuso

Published in: Technology, Art & Photos
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,742
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
21
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Jay Gattuso Persistently Identifying Formats

  1. 1. ‘Persistently’ Identifying Formats PRONOM, DROID and the NDHA Jay Gattuso Digital Preservation Analyst National Digital Heritage Archive National Library of New Zealand
  2. 2. Summary How Rosetta uses DROID How DROID has changedResearch NDHA completed Results Recommendations
  3. 3. DROID & PRONOM• PRONOM is the most widely used file format registry in the sector• DROID is a tool that ‘identifies’ file types (based on PRONOM records)• Both are from TNA (UK)• DROID Signature v59 EP/1958/2520-F – 551 signature sets Registry, Hunter Building, Victoria University of Wellington Photograph taken for the Evening Post newspaper, 31 Jul 1958 – 864 file type records Alexander Turnbull Librarywww.nationalarchives.gov.uk/PRONOM/Default.aspx
  4. 4. Rosetta – A Brief History• NLNZ Digital Preservation Repository• 4 years since inception• 18 months out of project• 8 significant upgrades/software revisions• ~6 Million digital objects to 1/1-000008-G Smileys stables and horse repository, Whanganui date Harding, William James, 1826-1899 :Negatives of Wanganui district . Alexander Turnbull Library• Backbone of the ANZ GDAP
  5. 5. Write Once, Read ManyInside Rosetta, formatidentification is a ‘WORM’ process.As a part of the ingestroutine, format identification isautomatically undertaken, writtento the file records, and the systemdatabase, and used thereafter asa consistent ‘label’.We rely on the persistence of thelabel to accurately plan activities and E-272-f-001‘measure’ the content or shape of the Abbot, John 1751-1840 : Original drawings of insects by J Abott. [1816?]repository. Alexander Turnbull Library .
  6. 6. Behaviours and functions based on DROID format assertionsRosetta uses DROID toautomatically establishformat type.
  7. 7. Rosetta Overview Validation Stack Automated FormatIdentification via DROID
  8. 8. Shape Sorting...Where:• The area inside the box is Rosetta• Each block is a DO• Each shape is a format• The ‘Sorter’ is DROID
  9. 9. Shape Sorting...Process:• A record is kept of the ‘shape’ the DO entered the box via• The record is used by the system to trigger activities• The DO can be removed from the box using the same shaped hole it used on entry
  10. 10. Shape Sorting...Expectations:• The ‘Sorter’ never changes• The blocks never change• A DO placed in the box yesterday will be the same shape tomorrow• A DO placed in the box yesterday will be extractable via the shape tomorrow
  11. 11. Shape Sorting...The reality for NDHA:• DROID has undergone 2 major revisions• Container signatures have been included• Since Rosetta v1 release: – 406 new formats, – 600 changes to signatures – (This is generally a good thing!)
  12. 12. Identifying and Quantifying Change• Rosetta has used DROID versions 3 and 5, currently testing with 6• Rosetta has used DROID signature versions v13, v37, v45 and v49, testing with v52• Proposal to use a new DROID method in Rosetta• How has/will this affect the way we characterise Digital Objects at EP/1958/0585-F the NDHA? Signature of Queen Elizabeth II in a visitors book Negatives of the Evening Post newspaper. Feb 1958 Alexander Turnbull Library
  13. 13. Identifying and Quantifying Change• Source set: – 26,000 digital objects, – ~600 Gb of content, – spanning 61 format types – all from the live system• DROID v3, DROID v5, DROID v6 and DROID v6 ‘FAST’ tested• Signatures v13, v37, v45, v49 and v50 tested EP/1990/0432/29-F New school patrol system being tested , Wellington• All files tested with and Photograph taken by John Nicholson ca 2 Feb 1990 without file extensions Alexander Turnbull Library
  14. 14. Identifying and Quantifying Change• 1 million DROID ‘assertions’ captured• Python and MySQL used to sort, clean, filter, draw graphics and otherwise interpret results• Paper competed and will be available on the OPF websitewww.openplanetsfoundation.org DCDL-0004533 Eric Idle. 5 December, 2007. Webb, Murray, 1947- : Digital caricatures published from 29 July 2005 onwards Alexander Turnbull Library
  15. 15. Summary of ResultsOf the 61 tested file types : 75% performed identically for all tested versions of DROID and signature versions fmt/49 (RTF 1.4)
  16. 16. Summary of ResultsOf the 61 tested file types : 40% consistently offered a single PUID across the range of DROID tests By extension: gif, avi, png, jpg, html, xml, bmp, wp, and some subsets of doc, ppt and exe fmt/12 (PNG 1.1)
  17. 17. Summary of ResultsOf the 61 tested file types : In 26% of the file types multiple PUIDs are equally asserted by DROID at various times. By extension: docx,xlsx,pptx, some pdf, doc, xls, ppt, txt, log, aif f, and arc fmt/7 (TIF format)
  18. 18. Summary of ResultsOf the 61 tested file types : In 16% of the file types DROID version 6 in ‘FAST’ mode performs differently DROID version 6 in standard mode By extension: epubs, mp4, flac, wav, zip and some subsets of pdf, xls, tif fmt/6 and exe (Waveform Audio)
  19. 19. Recommendation 1There is a clear needfor a communityowned dataset thatspans the PRONOMcatalogue to supporttesting(This should becommunity created) ExL-fmt/62 - fmt/189 (MS Open Office XML 2007)
  20. 20. Recommendation 2It is stronglyrecommended thatmore research isundertaken looking atthe persistence ofPUID’s to give a morecomplete history offile type assertions byPRONOM/DROID fmt/14 (PDF 1.0)
  21. 21. Recommendation 3Given the variancesobserved, especially withDROID v6 ‘FAST’ mode, itis recommended that allsignatures are robustlytested prior torelease, and efforts aremade to maintainconsistency with legacysignatures, and limit x-fmt/263 (ZIP format)impact on users
  22. 22. Recap How Rosetta uses DROID How DROID has changedResearch NDHA completed Results Recommendations
  23. 23. Thank you jay.gattuso@dia.govt.nz Rosetta demo – Wednesday 28th March 9am to 1pm @ NLNZ - 77 Thorndon QuayPaper available through the Open Planets Website www.openplanetsfoundation.org

×