Cochrane von Suchodoletz File Creation, Rendering and Formats


Published on

File Creation, Rendering and Formats
Euan Cochrane and Dirk von Suchodoletz

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Digital Preservation definition based on: File format definition from Wikipedia:
  • On HTML docs: Ian Hickson “ There are literally dozens if not hundreds of billions of documents already on the Web. A study of a sample of several billion of those documents with a test implementation of the HTML5 Parser specification that I did at Google put a very conservative estimate of the fraction of those pages with markup errors at more than 78%. When I tweaked it a bit to look at a few more errors, the number was 93%. And those are only core syntax errors -- it didn't count misuse of HTML, like putting a <p> element inside an <ol> element. If we required browsers to refuse those documents, then you couldn't browse over 90% of the Web. But consider -- if one browser showed error messages on half the Web, and another browser showed no errors and instead showed the Web roughly as the author intended. Which browser would the average person use? If we want to make HTML5 successful, we have to make sure the browser vendors pay attention to it. Any requirements that make their market share go down relative to browsers who aren't following the spec will immediately be ignored.” On MS Office 2007 & ODS: And On Office 2007, 2010 writing office 97-2003 files differently: This result was identified during Internal research when developing the Office Dependency Discovery Tool: The differences involve content being added to the files in XML form that seems to be ignored by the earlier versions of Office but used in some way by the more recent versions.
  • JPEG: And here: LibreOffice and Visio: And here:
  • Rendering Matters Report available here:
  • Cochrane von Suchodoletz File Creation, Rendering and Formats

    1. 1. File Creation, Rendering and Formats Euan Cochrane, Archives New Zealand &Dirk von Suchodoletz, University of Freiburg Future Perfect 2012 26 March 2012 Wellington, New Zealand
    2. 2. ContentsEuan•Files, formats and their relationships to creating applications•Files, formats and their relationships to rendering applicationsDirk•Maintaining the ability to use older rendering applicationsEuan•Context and conclusions
    3. 3. Digital Preservation• What is digital preservation?Maintaining the full information content of digital objects [across time]Maintaining the ability to render digital objects [across time]“The goal of digital preservation is the accurate rendering of authenticated content over time”• What is a file format?“[pre-defined/particular] way that information is encoded for storage in a computer file”
    4. 4. File Creation and Formats• In 2007 Over 90% of HTML documents did not conform to standards• Microsoft Office 2007(and possibly 2010) create ODS files differently to mostopen source office suites.• Microsoft Office 2007 and 2010 create Microsoft Office 97-2003 formatted files differently to Microsoft Office 97-2003
    5. 5. Format Standards are OftenAmbiguous or not Available• The JPEG standard specifies an end of image marker but not an end of file marker – Different apps write them differently• LibreOffice 3.5 (14 February 2012) now “supports” Visio file import. This support is based on reverse engineering as the format standard is not publically available. It is not complete
    6. 6. “Rendering Matters” Research• Compared the rendering of ~100 files on old software running on old hardware (the “control”) to: 1. LibreOffice version 3.3.0 2. Microsoft Office 2007 3. Word Perfect Office X5 4. Control Software running on emulated hardware
    7. 7. Summary Research Results• [The choice of] Rendering [Environment] Matters• MS-Office 2007 was a better rendering tool for the old files than either LibreOffice or WordPerfect Office• The use of particular attributes/features in office files is inconsistent but most are used at least once.• At least one “odd”/rare attribute/feature is included in most office files
    8. 8. Original Environments (OE) Original creating application best candidate to render documents properly Proprietary format knowledge embedded in the application One environment renders all objects of a certain type Keeping original software (and hardware) environments has impact on preservation and access workflows
    9. 9. Components of Access through OE Emulators for different computer architectures Software archive of all required applica-tions, operating systems, additional components like fonts, codecs Workflows on object ingest Access systems for end users
    10. 10. Emulators Wide range available for all relevant computer architectures Many Open Source Not yet DP aware – long term availability to be secured DP community should seek more influence
    11. 11. Software Archive Preserve the relevant software components and operational knowledge
    12. 12. Necessary Workflows Freiburg digital preservation group leads the state- sponsored two years bwFLA project BwFLA project providing access to complex, interactive digital objects Provide extended ingest workflows with feedback loop
    13. 13. Extended Ingest Workflow Make use of donators expertise to collect complete information and components  Extend software archive if necessary  Add necessary technical metadata  Record knowledge on object handling Let the donor check and sign-off the rendering results
    14. 14. Access Workflows Provide a reading room system or extension – Pre-configure emulator to the OE required by the object – Prepare the inclusion of the object into the original environment – Automate the startup of the OSE – Provide the user information and hints on how to interact with the OE & automate parts of this – (Dis)allow to a certain degree to save results from the original environments or capture certain states (e.g. using screenshots)
    15. 15. Access System Many components already exist, develo-ped by past DP projects Next step: Make them a usable “product”
    16. 16. Reading Room Access System Make emulation accessible to standard users like in memory institutions Robust platform, extension to standard reading room systems Unified access to a wide range of different emulators + preconfigured environments
    17. 17. Context and Conclusions• Making decisions about preservation strategies• When to Normalise?• Variation in format implementation doesn’t matter if you maintain a compatible rendering environment• Variation in rendering across environments doesn’t matter if you maintain the “right” rendering environment• There are practical options for maintaining rendering environments
    18. 18. Thank you