SlideShare a Scribd company logo
‘Persistently’ Identifying Formats

    PRONOM, DROID and the NDHA


                     Jay Gattuso
            Digital Preservation Analyst
          National Digital Heritage Archive
          National Library of New Zealand
Summary

 How Rosetta uses DROID
 How DROID has changed
Research NDHA completed
         Results
   Recommendations
DROID & PRONOM
• PRONOM is the most
  widely used file format
  registry in the sector
• DROID is a tool that
  ‘identifies’ file types (based
  on PRONOM records)
• Both are from TNA (UK)
• DROID Signature v59
                                                               EP/1958/2520-F
    – 551 signature sets            Registry, Hunter Building, Victoria University of Wellington
                                       Photograph taken for the Evening Post newspaper, 31 Jul 1958

    – 864 file type records                               Alexander Turnbull Library



www.nationalarchives.gov.uk/PRONOM/Default.aspx
Rosetta – A Brief History
• NLNZ Digital Preservation
  Repository
• 4 years since inception
• 18 months out of project
• 8 significant
  upgrades/software
  revisions
• ~6 Million digital objects to                              1/1-000008-G
                                           Smiley's stables and horse repository, Whanganui
  date                            Harding, William James, 1826-1899 :Negatives of Wanganui district .
                                                       Alexander Turnbull Library


• Backbone of the ANZ GDAP
Write Once, Read Many
Inside Rosetta, format
identification is a ‘WORM’ process.

As a part of the ingest
routine, format identification is
automatically undertaken, written
to the file records, and the system
database, and used thereafter as
a consistent ‘label’.

We rely on the persistence of the
label to accurately plan activities and
                                                           E-272-f-001
‘measure’ the content or shape of the                 Abbot, John 1751-1840 :
                                          Original drawings of insects by J Abott. [1816?]
repository.                                          Alexander Turnbull Library
                                                                  .
Behaviours and functions based on
    DROID format assertions




Rosetta uses DROID to
automatically establish
format type.
Rosetta Overview




 Validation Stack
  Automated Format
Identification via DROID
Shape Sorting...


Where:

• The area inside the box
  is Rosetta
• Each block is a DO
• Each shape is a format
• The ‘Sorter’ is DROID
Shape Sorting...

Process:

• A record is kept of the
  ‘shape’ the DO entered the
  box via
• The record is used by the
  system to trigger activities
• The DO can be removed from
  the box using the same
  shaped hole it used on entry
Shape Sorting...

Expectations:

• The ‘Sorter’ never changes
• The blocks never change
• A DO placed in the box
  yesterday will be the same
  shape tomorrow
• A DO placed in the box
  yesterday will be extractable
  via the shape tomorrow
Shape Sorting...

The reality for NDHA:

• DROID has undergone 2
  major revisions
• Container signatures have
  been included
• Since Rosetta v1 release:
   – 406 new formats,
   – 600 changes to signatures
   – (This is generally a good thing!)
Identifying and Quantifying Change

• Rosetta has used DROID versions
  3 and 5, currently testing with 6
• Rosetta has used DROID
  signature versions v13, v37, v45
  and v49, testing with v52
• Proposal to use a new DROID
  method in Rosetta
• How has/will this affect the way
  we characterise Digital Objects at                     EP/1958/0585-F
  the NDHA?                             Signature of Queen Elizabeth II in a visitors book
                                       Negatives of the Evening Post newspaper. Feb 1958
                                                   Alexander Turnbull Library
Identifying and Quantifying Change

• Source set:
   –   26,000 digital objects,
   –   ~600 Gb of content,
   –   spanning 61 format types
   –   all from the live system
• DROID v3, DROID v5, DROID v6
  and DROID v6 ‘FAST’ tested
• Signatures v13, v37, v45, v49
  and v50 tested                                 EP/1990/0432/29-F
                                  New school patrol system being tested , Wellington
• All files tested with and             Photograph taken by John Nicholson
                                                    ca 2 Feb 1990
  without file extensions                    Alexander Turnbull Library
Identifying and Quantifying Change

• 1 million DROID ‘assertions’ captured
• Python and MySQL used to
  sort, clean, filter, draw graphics and
  otherwise interpret results
• Paper competed and will be available
  on the OPF website


www.openplanetsfoundation.org                           DCDL-0004533
                                                Eric Idle. 5 December, 2007.
                                    Webb, Murray, 1947- : Digital caricatures published from
                                                    29 July 2005 onwards
                                                  Alexander Turnbull Library
Summary of Results
Of the 61 tested file types :

  75% performed identically
  for all tested versions of
  DROID and signature
  versions



                                   fmt/49
                                  (RTF 1.4)
Summary of Results
Of the 61 tested file types :

  40% consistently offered
  a single PUID across the
  range of DROID tests

  By extension: gif, avi, png,
  jpg, html, xml, bmp, wp, and
  some subsets of doc, ppt and
  exe
                                     fmt/12
                                    (PNG 1.1)
Summary of Results
Of the 61 tested file types :

  In 26% of the file types
  multiple PUIDs are
  equally asserted by
  DROID at various times.

  By extension:
  docx,xlsx,pptx, some
  pdf, doc, xls, ppt, txt, log, aif
  f, and arc                              fmt/7
                                       (TIF format)
Summary of Results
Of the 61 tested file types :
  In 16% of the file types
  DROID version 6 in ‘FAST’
  mode performs differently
  DROID version 6 in
  standard mode
  By extension:
  epubs, mp4, flac, wav, zip and
  some subsets of pdf, xls, tif           fmt/6
  and exe                            (Waveform Audio)
Recommendation 1

There is a clear need
for a community
owned dataset that
spans the PRONOM
catalogue to support
testing

(This should be
community created)                 ExL-fmt/62 - fmt/189
                                 (MS Open Office XML 2007)
Recommendation 2

It is strongly
recommended that
more research is
undertaken looking at
the persistence of
PUID’s to give a more
complete history of
file type assertions by
PRONOM/DROID
                                   fmt/14
                                  (PDF 1.0)
Recommendation 3
Given the variances
observed, especially with
DROID v6 ‘FAST’ mode, it
is recommended that all
signatures are robustly
tested prior to
release, and efforts are
made to maintain
consistency with legacy
signatures, and limit             x-fmt/263
                                 (ZIP format)
impact on users
Recap

 How Rosetta uses DROID
 How DROID has changed
Research NDHA completed
         Results
   Recommendations
Thank you

            jay.gattuso@dia.govt.nz

     Rosetta demo – Wednesday 28th March
    9am to 1pm @ NLNZ - 77 Thorndon Quay
Paper available through the Open Planets Website
        www.openplanetsfoundation.org

More Related Content

Similar to Jay Gattuso Persistently Identifying Formats

KeepIt Course 3: preservation workflow
KeepIt Course 3: preservation workflowKeepIt Course 3: preservation workflow
KeepIt Course 3: preservation workflow
JISC KeepIt project
 
LORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal depositLORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal deposit
FIAT/IFTA
 
من المكتبات الجامعية إلى المكتبات الرقمية
من المكتبات الجامعية إلى المكتبات الرقميةمن المكتبات الجامعية إلى المكتبات الرقمية
من المكتبات الجامعية إلى المكتبات الرقمية
يوسف لمحنط
 
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Keepit Course 3: Provenance (and OPM), based on slides by Luc MoreauKeepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
JISC KeepIt project
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
PlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
Oscar Corcho
 
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
peterchanws
 
Accessioning Born-Digital Materials
Accessioning Born-Digital MaterialsAccessioning Born-Digital Materials
Accessioning Born-Digital Materials
peterchanws
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
ekobelasting
 
File Formats for Preservation
File Formats for PreservationFile Formats for Preservation
File Formats for Preservation
Stephen Gray
 
Navigating the Analog Waves: Digitizing Audio Cassettes for Your Collection
Navigating the Analog Waves: Digitizing Audio Cassettes for Your CollectionNavigating the Analog Waves: Digitizing Audio Cassettes for Your Collection
Navigating the Analog Waves: Digitizing Audio Cassettes for Your Collection
Kay Gregg
 
Rbms 2011 edwards
Rbms 2011 edwardsRbms 2011 edwards
Rbms 2011 edwardsglynnedw
 
RBMS 2011 edwards
RBMS 2011 edwardsRBMS 2011 edwards
RBMS 2011 edwardsglynnedw
 
RBMS 2011_Edwards
RBMS 2011_EdwardsRBMS 2011_Edwards
RBMS 2011_Edwardsglynnedw
 
club course two - unix
club course two - unixclub course two - unix
club course two - unix
shelling ford
 
Danis biosystematics2011
Danis biosystematics2011Danis biosystematics2011
Danis biosystematics2011
Bruno Danis
 
Linuxtraining 130710022121-phpapp01
Linuxtraining 130710022121-phpapp01Linuxtraining 130710022121-phpapp01
Linuxtraining 130710022121-phpapp01Chander Pandey
 
Autopsy Digital forensics tool
Autopsy Digital forensics toolAutopsy Digital forensics tool
Autopsy Digital forensics tool
Sreekanth Narendran
 
(130511) #fitalk network forensics and its role and scope
(130511) #fitalk   network forensics and its role and scope(130511) #fitalk   network forensics and its role and scope
(130511) #fitalk network forensics and its role and scope
INSIGHT FORENSIC
 
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE
 

Similar to Jay Gattuso Persistently Identifying Formats (20)

KeepIt Course 3: preservation workflow
KeepIt Course 3: preservation workflowKeepIt Course 3: preservation workflow
KeepIt Course 3: preservation workflow
 
LORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal depositLORENZ Building an integrated digital media archive and legal deposit
LORENZ Building an integrated digital media archive and legal deposit
 
من المكتبات الجامعية إلى المكتبات الرقمية
من المكتبات الجامعية إلى المكتبات الرقميةمن المكتبات الجامعية إلى المكتبات الرقمية
من المكتبات الجامعية إلى المكتبات الرقمية
 
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Keepit Course 3: Provenance (and OPM), based on slides by Luc MoreauKeepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
SCA Accessioning Born-Digital Materials Workshop, Nov. 8, 2012
 
Accessioning Born-Digital Materials
Accessioning Born-Digital MaterialsAccessioning Born-Digital Materials
Accessioning Born-Digital Materials
 
AntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdfAntiForensics - Leveraging OS and File System Artifacts.pdf
AntiForensics - Leveraging OS and File System Artifacts.pdf
 
File Formats for Preservation
File Formats for PreservationFile Formats for Preservation
File Formats for Preservation
 
Navigating the Analog Waves: Digitizing Audio Cassettes for Your Collection
Navigating the Analog Waves: Digitizing Audio Cassettes for Your CollectionNavigating the Analog Waves: Digitizing Audio Cassettes for Your Collection
Navigating the Analog Waves: Digitizing Audio Cassettes for Your Collection
 
Rbms 2011 edwards
Rbms 2011 edwardsRbms 2011 edwards
Rbms 2011 edwards
 
RBMS 2011 edwards
RBMS 2011 edwardsRBMS 2011 edwards
RBMS 2011 edwards
 
RBMS 2011_Edwards
RBMS 2011_EdwardsRBMS 2011_Edwards
RBMS 2011_Edwards
 
club course two - unix
club course two - unixclub course two - unix
club course two - unix
 
Danis biosystematics2011
Danis biosystematics2011Danis biosystematics2011
Danis biosystematics2011
 
Linuxtraining 130710022121-phpapp01
Linuxtraining 130710022121-phpapp01Linuxtraining 130710022121-phpapp01
Linuxtraining 130710022121-phpapp01
 
Autopsy Digital forensics tool
Autopsy Digital forensics toolAutopsy Digital forensics tool
Autopsy Digital forensics tool
 
(130511) #fitalk network forensics and its role and scope
(130511) #fitalk   network forensics and its role and scope(130511) #fitalk   network forensics and its role and scope
(130511) #fitalk network forensics and its role and scope
 
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
OpenAIRE webinar: Principles of Research Data Management, with S. Venkatarama...
 

More from Future Perfect 2012

Working Across Organizations white paper
Working Across Organizations white paperWorking Across Organizations white paper
Working Across Organizations white paper
Future Perfect 2012
 
Ensuring Data Integrity white paper
Ensuring Data Integrity white paperEnsuring Data Integrity white paper
Ensuring Data Integrity white paper
Future Perfect 2012
 
Bigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie LeanBigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie Lean
Future Perfect 2012
 
Steve Knight by Design
Steve Knight by DesignSteve Knight by Design
Steve Knight by Design
Future Perfect 2012
 
Michael Parsons Passion
Michael Parsons PassionMichael Parsons Passion
Michael Parsons Passion
Future Perfect 2012
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Future Perfect 2012
 
Joe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryJoe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage Library
Future Perfect 2012
 
James Smithies Academic Earthquake Research
James Smithies Academic Earthquake ResearchJames Smithies Academic Earthquake Research
James Smithies Academic Earthquake Research
Future Perfect 2012
 
Martin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineMartin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP Online
Future Perfect 2012
 
Steve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data ArchiveSteve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data Archive
Future Perfect 2012
 
Parul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right CombinationParul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right Combination
Future Perfect 2012
 
Alison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for SuccessAlison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for Success
Future Perfect 2012
 
Andrew Waugh Business Systems
Andrew Waugh Business SystemsAndrew Waugh Business Systems
Andrew Waugh Business Systems
Future Perfect 2012
 
Gabe Nault Data Integrity
Gabe Nault Data IntegrityGabe Nault Data Integrity
Gabe Nault Data Integrity
Future Perfect 2012
 
Clare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in DatabasesClare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in Databases
Future Perfect 2012
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
Future Perfect 2012
 
Dave Pearson The Adventures of Digi
Dave Pearson The Adventures of DigiDave Pearson The Adventures of Digi
Dave Pearson The Adventures of Digi
Future Perfect 2012
 
Jeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation PerspectiveJeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation Perspective
Future Perfect 2012
 
Stuart Wakefield Cloud Computing
Stuart Wakefield Cloud ComputingStuart Wakefield Cloud Computing
Stuart Wakefield Cloud Computing
Future Perfect 2012
 
Cassie Findlay Digital Transformation SRNSW
Cassie Findlay Digital Transformation SRNSWCassie Findlay Digital Transformation SRNSW
Cassie Findlay Digital Transformation SRNSW
Future Perfect 2012
 

More from Future Perfect 2012 (20)

Working Across Organizations white paper
Working Across Organizations white paperWorking Across Organizations white paper
Working Across Organizations white paper
 
Ensuring Data Integrity white paper
Ensuring Data Integrity white paperEnsuring Data Integrity white paper
Ensuring Data Integrity white paper
 
Bigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie LeanBigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie Lean
 
Steve Knight by Design
Steve Knight by DesignSteve Knight by Design
Steve Knight by Design
 
Michael Parsons Passion
Michael Parsons PassionMichael Parsons Passion
Michael Parsons Passion
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
Joe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryJoe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage Library
 
James Smithies Academic Earthquake Research
James Smithies Academic Earthquake ResearchJames Smithies Academic Earthquake Research
James Smithies Academic Earthquake Research
 
Martin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineMartin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP Online
 
Steve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data ArchiveSteve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data Archive
 
Parul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right CombinationParul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right Combination
 
Alison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for SuccessAlison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for Success
 
Andrew Waugh Business Systems
Andrew Waugh Business SystemsAndrew Waugh Business Systems
Andrew Waugh Business Systems
 
Gabe Nault Data Integrity
Gabe Nault Data IntegrityGabe Nault Data Integrity
Gabe Nault Data Integrity
 
Clare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in DatabasesClare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in Databases
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Dave Pearson The Adventures of Digi
Dave Pearson The Adventures of DigiDave Pearson The Adventures of Digi
Dave Pearson The Adventures of Digi
 
Jeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation PerspectiveJeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation Perspective
 
Stuart Wakefield Cloud Computing
Stuart Wakefield Cloud ComputingStuart Wakefield Cloud Computing
Stuart Wakefield Cloud Computing
 
Cassie Findlay Digital Transformation SRNSW
Cassie Findlay Digital Transformation SRNSWCassie Findlay Digital Transformation SRNSW
Cassie Findlay Digital Transformation SRNSW
 

Recently uploaded

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 

Recently uploaded (20)

Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 

Jay Gattuso Persistently Identifying Formats

  • 1. ‘Persistently’ Identifying Formats PRONOM, DROID and the NDHA Jay Gattuso Digital Preservation Analyst National Digital Heritage Archive National Library of New Zealand
  • 2. Summary How Rosetta uses DROID How DROID has changed Research NDHA completed Results Recommendations
  • 3. DROID & PRONOM • PRONOM is the most widely used file format registry in the sector • DROID is a tool that ‘identifies’ file types (based on PRONOM records) • Both are from TNA (UK) • DROID Signature v59 EP/1958/2520-F – 551 signature sets Registry, Hunter Building, Victoria University of Wellington Photograph taken for the Evening Post newspaper, 31 Jul 1958 – 864 file type records Alexander Turnbull Library www.nationalarchives.gov.uk/PRONOM/Default.aspx
  • 4. Rosetta – A Brief History • NLNZ Digital Preservation Repository • 4 years since inception • 18 months out of project • 8 significant upgrades/software revisions • ~6 Million digital objects to 1/1-000008-G Smiley's stables and horse repository, Whanganui date Harding, William James, 1826-1899 :Negatives of Wanganui district . Alexander Turnbull Library • Backbone of the ANZ GDAP
  • 5. Write Once, Read Many Inside Rosetta, format identification is a ‘WORM’ process. As a part of the ingest routine, format identification is automatically undertaken, written to the file records, and the system database, and used thereafter as a consistent ‘label’. We rely on the persistence of the label to accurately plan activities and E-272-f-001 ‘measure’ the content or shape of the Abbot, John 1751-1840 : Original drawings of insects by J Abott. [1816?] repository. Alexander Turnbull Library .
  • 6. Behaviours and functions based on DROID format assertions Rosetta uses DROID to automatically establish format type.
  • 7. Rosetta Overview Validation Stack Automated Format Identification via DROID
  • 8. Shape Sorting... Where: • The area inside the box is Rosetta • Each block is a DO • Each shape is a format • The ‘Sorter’ is DROID
  • 9. Shape Sorting... Process: • A record is kept of the ‘shape’ the DO entered the box via • The record is used by the system to trigger activities • The DO can be removed from the box using the same shaped hole it used on entry
  • 10. Shape Sorting... Expectations: • The ‘Sorter’ never changes • The blocks never change • A DO placed in the box yesterday will be the same shape tomorrow • A DO placed in the box yesterday will be extractable via the shape tomorrow
  • 11. Shape Sorting... The reality for NDHA: • DROID has undergone 2 major revisions • Container signatures have been included • Since Rosetta v1 release: – 406 new formats, – 600 changes to signatures – (This is generally a good thing!)
  • 12. Identifying and Quantifying Change • Rosetta has used DROID versions 3 and 5, currently testing with 6 • Rosetta has used DROID signature versions v13, v37, v45 and v49, testing with v52 • Proposal to use a new DROID method in Rosetta • How has/will this affect the way we characterise Digital Objects at EP/1958/0585-F the NDHA? Signature of Queen Elizabeth II in a visitors book Negatives of the Evening Post newspaper. Feb 1958 Alexander Turnbull Library
  • 13. Identifying and Quantifying Change • Source set: – 26,000 digital objects, – ~600 Gb of content, – spanning 61 format types – all from the live system • DROID v3, DROID v5, DROID v6 and DROID v6 ‘FAST’ tested • Signatures v13, v37, v45, v49 and v50 tested EP/1990/0432/29-F New school patrol system being tested , Wellington • All files tested with and Photograph taken by John Nicholson ca 2 Feb 1990 without file extensions Alexander Turnbull Library
  • 14. Identifying and Quantifying Change • 1 million DROID ‘assertions’ captured • Python and MySQL used to sort, clean, filter, draw graphics and otherwise interpret results • Paper competed and will be available on the OPF website www.openplanetsfoundation.org DCDL-0004533 Eric Idle. 5 December, 2007. Webb, Murray, 1947- : Digital caricatures published from 29 July 2005 onwards Alexander Turnbull Library
  • 15. Summary of Results Of the 61 tested file types : 75% performed identically for all tested versions of DROID and signature versions fmt/49 (RTF 1.4)
  • 16. Summary of Results Of the 61 tested file types : 40% consistently offered a single PUID across the range of DROID tests By extension: gif, avi, png, jpg, html, xml, bmp, wp, and some subsets of doc, ppt and exe fmt/12 (PNG 1.1)
  • 17. Summary of Results Of the 61 tested file types : In 26% of the file types multiple PUIDs are equally asserted by DROID at various times. By extension: docx,xlsx,pptx, some pdf, doc, xls, ppt, txt, log, aif f, and arc fmt/7 (TIF format)
  • 18. Summary of Results Of the 61 tested file types : In 16% of the file types DROID version 6 in ‘FAST’ mode performs differently DROID version 6 in standard mode By extension: epubs, mp4, flac, wav, zip and some subsets of pdf, xls, tif fmt/6 and exe (Waveform Audio)
  • 19. Recommendation 1 There is a clear need for a community owned dataset that spans the PRONOM catalogue to support testing (This should be community created) ExL-fmt/62 - fmt/189 (MS Open Office XML 2007)
  • 20. Recommendation 2 It is strongly recommended that more research is undertaken looking at the persistence of PUID’s to give a more complete history of file type assertions by PRONOM/DROID fmt/14 (PDF 1.0)
  • 21. Recommendation 3 Given the variances observed, especially with DROID v6 ‘FAST’ mode, it is recommended that all signatures are robustly tested prior to release, and efforts are made to maintain consistency with legacy signatures, and limit x-fmt/263 (ZIP format) impact on users
  • 22. Recap How Rosetta uses DROID How DROID has changed Research NDHA completed Results Recommendations
  • 23. Thank you jay.gattuso@dia.govt.nz Rosetta demo – Wednesday 28th March 9am to 1pm @ NLNZ - 77 Thorndon Quay Paper available through the Open Planets Website www.openplanetsfoundation.org