•Andrew Jackson
•Web Archiving Technical Lead
•British Library
Unified Characterisation, Please
The Practitioners' Have Spoken…
 Quality Assurance (of broken or potentially broken data):
 Quality assurance, Bit rot, ...
Appraisal and Assessment
Conformance, Unknown characteristics, and Unknown file
formats. Identify/Locate Preservation Wort...
• t
4
• this
5
Identify Preservation Risks
Obsolescence, preservation risk and business constraint
 Significant Properties are irrelevan...
Sustainable Tools
Our Tools
 Pure-Java Characterisation:
 JHOVE („clean room‟ implementation)
 New Zealand Metadata Ext...
Sustainable Tools
Up to date? Working together?
 Software Dependency Management:
 FITS/JHOVE2 embed old DROID versions, ...
Sustainable Tools
Shared tests?
 Separate projects arise from separate workflows
 Start by understand commonality and fi...
Bit-mashing as Tool QA
 Bitwise exploration of data sensitivity.
 One way to compare tools.
 Helps understand formats.
...
Quality Assurance (of broken or potentially broken data)
Quality assurance, Bit rot, and Integrity
 JHOVE let failed TIFF...
Where's the unification?
Where should we work together?
 Shared test corpora and test framework:
 Start with the OPF For...
Upcoming SlideShare
Loading in …5
×

Unified characterisation, please

350 views
265 views

Published on

Slides give at the SPRUCEdp Unified Characterisation event.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
350
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Unified characterisation, please

  1. 1. •Andrew Jackson •Web Archiving Technical Lead •British Library Unified Characterisation, Please
  2. 2. The Practitioners' Have Spoken…  Quality Assurance (of broken or potentially broken data):  Quality assurance, Bit rot, and Integrity  Appraisal and Assessment:  Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats.  Identify/Locate Preservation Worthy Data  Identify Preservation Risks:  Obsolescence, preservation risk and business constraint  Long tail of many other issues:  Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost.  Plus: Sustainable Tools 2
  3. 3. Appraisal and Assessment Conformance, Unknown characteristics, and Unknown file formats. Identify/Locate Preservation Worthy Data  Identification  Always used to „route‟ data to software that can understand it.  Use minimum information to identify:  e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such.  Validation  Two modes needed: “Fast fail”, “Log and continue” /Quirks  Stop baseless distinction between “Well formed” and “Valid”  Validation is irrelevant to digital preservation assessment:  e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.  We‟re on the wrong side of Postel‟s Law.  Unknown completeness and failure to future-proof:  e.g. JHOVE tries to validate versions of PDF it cannot know.  e.g. Tools sometimes interpret/migrate data opaquely. 3
  4. 4. • t 4
  5. 5. • this 5
  6. 6. Identify Preservation Risks Obsolescence, preservation risk and business constraint  Significant Properties are irrelevant here.  It‟s not really about the content, but about the context.  Dependency Analysis:  What software does this need?  Does this file use format features that are not well supported across implementations?  What other resources are transcluded?  Fonts? c.f. OfficeDDT.  Remote embeds?  Embedded scripts that might mask dependencies?  Do some operations require a password?  e.g. JHOVE cannot spot „harmless‟ PDF encryption. 6
  7. 7. Sustainable Tools Our Tools  Pure-Java Characterisation:  JHOVE („clean room‟ implementation)  New Zealand Metadata Extractor (NZME)  Apache Tika  Java-based aggregation of various CLI tools:  JHOVE2  FITS  Other Characterisation:  XCL – C++/XML „clean room‟ extended with ImageMagick  Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...  Identification:  DROID, FIDO, Apache Tika, File  Visualisation:  C3PO, and many non-specialised tools. 7
  8. 8. Sustainable Tools Up to date? Working together?  Software Dependency Management:  FITS/JHOVE2 embed old DROID versions, hard to upgrade.  Dead dependencies: FITS and FFIdent, NZME and Jflac.  Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?  Embed shared modules instead?  Software Project Management and Communication:  JHOVE, JHOVE2? FITS?  JHOVE2 only compiles on Sheila‟s branch?  Roadmaps, issue management, testing, C.I., etc.  Cross-project coordination and bug-fixing?  Complexity: JHOVE2, XCL, extremely complex  JHOVE2 Berkley DB causes checksum failures in tests  Tika solves same problem using SAX 8
  9. 9. Sustainable Tools Shared tests?  Separate projects arise from separate workflows  Start by understand commonality and find gaps?  Share test cases and compare results?  The OPF Format Corpus contains various valid and invalid files.  Built by practitioners' to test real use cases.  e.g. JP2 features, PDF Cabinet of Horrors.  Do the tools give consistent and complementary results?  Let‟s find out!  c.f. Dave Tarrant‟s REF for Identification:  http://data.openplanetsfoundation.org/ref/  http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 9
  10. 10. Bit-mashing as Tool QA  Bitwise exploration of data sensitivity.  One way to compare tools.  Helps understand formats.  c.f. Jay Gattuso‟s recent OPF blog. 10
  11. 11. Quality Assurance (of broken or potentially broken data) Quality assurance, Bit rot, and Integrity  JHOVE let failed TIFF-JP2 through…  Jpylyzer does better.  Both fall far short of actual rendering. 11
  12. 12. Where's the unification? Where should we work together?  Shared test corpora and test framework:  Start with the OPF Format Corpus?  Pull other corpora in by reference:  http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A  Sustainable version of Dave Tarrant‟s REF?  Extend with bit-mashing to compare tools?  Aim to coordinate more:  Make it clear where to go? (More about OfficeDDT).  Consider merging projects?  Consider sharing underlying libraries?  Consider building Tika modules?  Please consider Apache Preflight as base for PDF validation. 12

×