Jay Gattuso Persistently Identifying Formats

‘Persistently’ Identifying Formats

PRONOM, DROID and the NDHA

Jay Gattuso
Digital Preservation Analyst
National Digital Heritage Archive
National Library of New Zealand

Summary

How Rosetta uses DROID
How DROID has changed
Research NDHA completed
Results
Recommendations

DROID & PRONOM
• PRONOM is the most
widely used file format
registry in the sector
• DROID is a tool that
‘identifies’ file types (based
on PRONOM records)
• Both are from TNA (UK)
• DROID Signature v59
EP/1958/2520-F
– 551 signature sets Registry, Hunter Building, Victoria University of Wellington
Photograph taken for the Evening Post newspaper, 31 Jul 1958

– 864 file type records Alexander Turnbull Library

www.nationalarchives.gov.uk/PRONOM/Default.aspx

Rosetta – A Brief History
• NLNZ Digital Preservation
Repository
• 4 years since inception
• 18 months out of project
• 8 significant
upgrades/software
revisions
• ~6 Million digital objects to 1/1-000008-G
Smiley's stables and horse repository, Whanganui
date Harding, William James, 1826-1899 :Negatives of Wanganui district .
Alexander Turnbull Library

• Backbone of the ANZ GDAP

Write Once, Read Many
Inside Rosetta, format
identification is a ‘WORM’ process.

As a part of the ingest
routine, format identification is
automatically undertaken, written
to the file records, and the system
database, and used thereafter as
a consistent ‘label’.

We rely on the persistence of the
label to accurately plan activities and
E-272-f-001
‘measure’ the content or shape of the Abbot, John 1751-1840 :
Original drawings of insects by J Abott. [1816?]
repository. Alexander Turnbull Library
.

Behaviours and functions based on
DROID format assertions

Rosetta uses DROID to
automatically establish
format type.

Rosetta Overview

Validation Stack
Automated Format
Identification via DROID

Shape Sorting...

Where:

• The area inside the box
is Rosetta
• Each block is a DO
• Each shape is a format
• The ‘Sorter’ is DROID

Shape Sorting...

Process:

• A record is kept of the
‘shape’ the DO entered the
box via
• The record is used by the
system to trigger activities
• The DO can be removed from
the box using the same
shaped hole it used on entry

Shape Sorting...

Expectations:

• The ‘Sorter’ never changes
• The blocks never change
• A DO placed in the box
yesterday will be the same
shape tomorrow
• A DO placed in the box
yesterday will be extractable
via the shape tomorrow

Shape Sorting...

The reality for NDHA:

• DROID has undergone 2
major revisions
• Container signatures have
been included
• Since Rosetta v1 release:
– 406 new formats,
– 600 changes to signatures
– (This is generally a good thing!)

Identifying and Quantifying Change

• Rosetta has used DROID versions
3 and 5, currently testing with 6
• Rosetta has used DROID
signature versions v13, v37, v45
and v49, testing with v52
• Proposal to use a new DROID
method in Rosetta
• How has/will this affect the way
we characterise Digital Objects at EP/1958/0585-F
the NDHA? Signature of Queen Elizabeth II in a visitors book
Negatives of the Evening Post newspaper. Feb 1958


• Source set:
– 26,000 digital objects,
– ~600 Gb of content,
– spanning 61 format types
– all from the live system
• DROID v3, DROID v5, DROID v6
and DROID v6 ‘FAST’ tested
• Signatures v13, v37, v45, v49
and v50 tested EP/1990/0432/29-F
New school patrol system being tested , Wellington
• All files tested with and Photograph taken by John Nicholson
ca 2 Feb 1990
without file extensions Alexander Turnbull Library


• 1 million DROID ‘assertions’ captured
• Python and MySQL used to
sort, clean, filter, draw graphics and
otherwise interpret results
• Paper competed and will be available
on the OPF website

www.openplanetsfoundation.org DCDL-0004533
Eric Idle. 5 December, 2007.
Webb, Murray, 1947- : Digital caricatures published from
29 July 2005 onwards

Summary of Results
Of the 61 tested file types :

75% performed identically
for all tested versions of
DROID and signature
versions

fmt/49
(RTF 1.4)

Summary of Results

40% consistently offered
a single PUID across the
range of DROID tests

By extension: gif, avi, png,
jpg, html, xml, bmp, wp, and
some subsets of doc, ppt and
exe
fmt/12
(PNG 1.1)

Summary of Results

In 26% of the file types
multiple PUIDs are
equally asserted by
DROID at various times.

By extension:
docx,xlsx,pptx, some
pdf, doc, xls, ppt, txt, log, aif
f, and arc fmt/7
(TIF format)

Summary of Results
In 16% of the file types
DROID version 6 in ‘FAST’
mode performs differently
DROID version 6 in
standard mode
By extension:
epubs, mp4, flac, wav, zip and
some subsets of pdf, xls, tif fmt/6
and exe (Waveform Audio)

Recommendation 1

There is a clear need
for a community
owned dataset that
spans the PRONOM
catalogue to support
testing

(This should be
community created) ExL-fmt/62 - fmt/189
(MS Open Office XML 2007)

Recommendation 2

It is strongly
recommended that
more research is
undertaken looking at
the persistence of
PUID’s to give a more
complete history of
file type assertions by
PRONOM/DROID
fmt/14
(PDF 1.0)

Recommendation 3
Given the variances
observed, especially with
DROID v6 ‘FAST’ mode, it
is recommended that all
signatures are robustly
tested prior to
release, and efforts are
made to maintain
consistency with legacy
signatures, and limit x-fmt/263
(ZIP format)
impact on users

Recap

How Rosetta uses DROID
How DROID has changed
Research NDHA completed
Results
Recommendations

Thank you

jay.gattuso@dia.govt.nz

Rosetta demo – Wednesday 28th March
9am to 1pm @ NLNZ - 77 Thorndon Quay
Paper available through the Open Planets Website
www.openplanetsfoundation.org

Jay Gattuso Persistently Identifying Formats

Recommended

Recommended

More Related Content

Similar to Jay Gattuso Persistently Identifying Formats

Similar to Jay Gattuso Persistently Identifying Formats (20)

More from Future Perfect 2012

More from Future Perfect 2012 (20)

Recently uploaded

Recently uploaded (20)

Jay Gattuso Persistently Identifying Formats