•Andrew Jackson
•Web Archiving Technical Lead
•British Library
Unified Characterisation, Please
The Practitioners' Have Spoken…
 Quality Assurance (of broken or potentially broken data):
 Quality assurance, Bit rot, and Integrity
 Appraisal and Assessment:
 Appraisal and assessment, Conformance, Unknown
characteristics, and Unknown file formats.
 Identify/Locate Preservation Worthy Data
 Identify Preservation Risks:
 Obsolescence, preservation risk and business constraint
 Long tail of many other issues:
 Contextual and Data capture issues through to Embedded
objects, and broader issues around Value and cost.
 Plus: Sustainable Tools
2
Appraisal and Assessment
Conformance, Unknown characteristics, and Unknown file
formats. Identify/Locate Preservation Worthy Data
 Identification
 Always used to „route‟ data to software that can understand it.
 Use minimum information to identify:
 e.g. header only if possible. “Truncated PDF”, not
“UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing
.dbf should be reported as such.
 Validation
 Two modes needed: “Fast fail”, “Log and continue” /Quirks
 Stop baseless distinction between “Well formed” and “Valid”
 Validation is irrelevant to digital preservation assessment:
 e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.
 We‟re on the wrong side of Postel‟s Law.
 Unknown completeness and failure to future-proof:
 e.g. JHOVE tries to validate versions of PDF it cannot know.
 e.g. Tools sometimes interpret/migrate data opaquely. 3
• t
4
• this
5
Identify Preservation Risks
Obsolescence, preservation risk and business constraint
 Significant Properties are irrelevant here.
 It‟s not really about the content, but about the context.
 Dependency Analysis:
 What software does this need?
 Does this file use format features that are not well supported
across implementations?
 What other resources are transcluded?
 Fonts? c.f. OfficeDDT.
 Remote embeds?
 Embedded scripts that might mask dependencies?
 Do some operations require a password?
 e.g. JHOVE cannot spot „harmless‟ PDF encryption.
6
Sustainable Tools
Our Tools
 Pure-Java Characterisation:
 JHOVE („clean room‟ implementation)
 New Zealand Metadata Extractor (NZME)
 Apache Tika
 Java-based aggregation of various CLI tools:
 JHOVE2
 FITS
 Other Characterisation:
 XCL – C++/XML „clean room‟ extended with ImageMagick
 Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...
 Identification:
 DROID, FIDO, Apache Tika, File
 Visualisation:
 C3PO, and many non-specialised tools.
7
Sustainable Tools
Up to date? Working together?
 Software Dependency Management:
 FITS/JHOVE2 embed old DROID versions, hard to upgrade.
 Dead dependencies: FITS and FFIdent, NZME and Jflac.
 Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?
 Embed shared modules instead?
 Software Project Management and Communication:
 JHOVE, JHOVE2? FITS?
 JHOVE2 only compiles on Sheila‟s branch?
 Roadmaps, issue management, testing, C.I., etc.
 Cross-project coordination and bug-fixing?
 Complexity: JHOVE2, XCL, extremely complex
 JHOVE2 Berkley DB causes checksum failures in tests
 Tika solves same problem using SAX 8
Sustainable Tools
Shared tests?
 Separate projects arise from separate workflows
 Start by understand commonality and find gaps?
 Share test cases and compare results?
 The OPF Format Corpus contains various valid and invalid files.
 Built by practitioners' to test real use cases.
 e.g. JP2 features, PDF Cabinet of Horrors.
 Do the tools give consistent and complementary results?
 Let‟s find out!
 c.f. Dave Tarrant‟s REF for Identification:
 http://data.openplanetsfoundation.org/ref/
 http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/
9
Bit-mashing as Tool QA
 Bitwise exploration of data sensitivity.
 One way to compare tools.
 Helps understand formats.
 c.f. Jay Gattuso‟s recent OPF blog.
10
Quality Assurance (of broken or potentially broken data)
Quality assurance, Bit rot, and Integrity
 JHOVE let failed TIFF-JP2 through…
 Jpylyzer does better.
 Both fall far short of actual rendering.
11
Where's the unification?
Where should we work together?
 Shared test corpora and test framework:
 Start with the OPF Format Corpus?
 Pull other corpora in by reference:
 http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A
 Sustainable version of Dave Tarrant‟s REF?
 Extend with bit-mashing to compare tools?
 Aim to coordinate more:
 Make it clear where to go? (More about OfficeDDT).
 Consider merging projects?
 Consider sharing underlying libraries?
 Consider building Tika modules?
 Please consider Apache Preflight as base for PDF validation.
12

Unified characterisation, please

  • 1.
    •Andrew Jackson •Web ArchivingTechnical Lead •British Library Unified Characterisation, Please
  • 2.
    The Practitioners' HaveSpoken…  Quality Assurance (of broken or potentially broken data):  Quality assurance, Bit rot, and Integrity  Appraisal and Assessment:  Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats.  Identify/Locate Preservation Worthy Data  Identify Preservation Risks:  Obsolescence, preservation risk and business constraint  Long tail of many other issues:  Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost.  Plus: Sustainable Tools 2
  • 3.
    Appraisal and Assessment Conformance,Unknown characteristics, and Unknown file formats. Identify/Locate Preservation Worthy Data  Identification  Always used to „route‟ data to software that can understand it.  Use minimum information to identify:  e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such.  Validation  Two modes needed: “Fast fail”, “Log and continue” /Quirks  Stop baseless distinction between “Well formed” and “Valid”  Validation is irrelevant to digital preservation assessment:  e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.  We‟re on the wrong side of Postel‟s Law.  Unknown completeness and failure to future-proof:  e.g. JHOVE tries to validate versions of PDF it cannot know.  e.g. Tools sometimes interpret/migrate data opaquely. 3
  • 4.
  • 5.
  • 6.
    Identify Preservation Risks Obsolescence,preservation risk and business constraint  Significant Properties are irrelevant here.  It‟s not really about the content, but about the context.  Dependency Analysis:  What software does this need?  Does this file use format features that are not well supported across implementations?  What other resources are transcluded?  Fonts? c.f. OfficeDDT.  Remote embeds?  Embedded scripts that might mask dependencies?  Do some operations require a password?  e.g. JHOVE cannot spot „harmless‟ PDF encryption. 6
  • 7.
    Sustainable Tools Our Tools Pure-Java Characterisation:  JHOVE („clean room‟ implementation)  New Zealand Metadata Extractor (NZME)  Apache Tika  Java-based aggregation of various CLI tools:  JHOVE2  FITS  Other Characterisation:  XCL – C++/XML „clean room‟ extended with ImageMagick  Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...  Identification:  DROID, FIDO, Apache Tika, File  Visualisation:  C3PO, and many non-specialised tools. 7
  • 8.
    Sustainable Tools Up todate? Working together?  Software Dependency Management:  FITS/JHOVE2 embed old DROID versions, hard to upgrade.  Dead dependencies: FITS and FFIdent, NZME and Jflac.  Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?  Embed shared modules instead?  Software Project Management and Communication:  JHOVE, JHOVE2? FITS?  JHOVE2 only compiles on Sheila‟s branch?  Roadmaps, issue management, testing, C.I., etc.  Cross-project coordination and bug-fixing?  Complexity: JHOVE2, XCL, extremely complex  JHOVE2 Berkley DB causes checksum failures in tests  Tika solves same problem using SAX 8
  • 9.
    Sustainable Tools Shared tests? Separate projects arise from separate workflows  Start by understand commonality and find gaps?  Share test cases and compare results?  The OPF Format Corpus contains various valid and invalid files.  Built by practitioners' to test real use cases.  e.g. JP2 features, PDF Cabinet of Horrors.  Do the tools give consistent and complementary results?  Let‟s find out!  c.f. Dave Tarrant‟s REF for Identification:  http://data.openplanetsfoundation.org/ref/  http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 9
  • 10.
    Bit-mashing as ToolQA  Bitwise exploration of data sensitivity.  One way to compare tools.  Helps understand formats.  c.f. Jay Gattuso‟s recent OPF blog. 10
  • 11.
    Quality Assurance (ofbroken or potentially broken data) Quality assurance, Bit rot, and Integrity  JHOVE let failed TIFF-JP2 through…  Jpylyzer does better.  Both fall far short of actual rendering. 11
  • 12.
    Where's the unification? Whereshould we work together?  Shared test corpora and test framework:  Start with the OPF Format Corpus?  Pull other corpora in by reference:  http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A  Sustainable version of Dave Tarrant‟s REF?  Extend with bit-mashing to compare tools?  Aim to coordinate more:  Make it clear where to go? (More about OfficeDDT).  Consider merging projects?  Consider sharing underlying libraries?  Consider building Tika modules?  Please consider Apache Preflight as base for PDF validation. 12