e-Services to Keep Your Digital Files Current


Published on

These slides were presented on April 1st, 2010 at the Archives 2, College park, Washington DC

Published in: Technology
  • Be the first to comment

  • Be the first to like this

e-Services to Keep Your Digital Files Current

  1. 1. e-Services to Keep Your Digital Fil C Di it l Files Current t Presented by: Peter Bajcsy -Research Scientist at NCSA -Associate Director of I-CHASS, I3 , Institute -Adjunct Assistant Professor, CS & ECE UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
  2. 2. Acknowledgement • This research was partially supported by a National Archives and Records Administration (NARA) ( ) supplement to NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners. • The views and conclusions contained in this doc ment ie s concl sions document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archives and Records Administration, or the U.S. government. • Contributions by: Peter Bajcsy Kenton McHenry Rob Bajcsy, McHenry, Kooper, Michal Ondrejcek, Jason Kastner, William McFadden, Sang-Chul Lee, Luigi Marini Imaginations unbound
  3. 3. Outline • Introduction • Technologies • File format conversion software registry • Automated file format conversions • Conversion quality assessment • Summary • Future Work
  4. 4. Introduction
  5. 5. Supporting NARA’s Strategic Plan • According to The Strategic Plan of The National Archives and Records Administration 2006–2016. “Preserving the Past to Protect the Future” • “Strategic Goal: We will preserve and process records to ensure access by the public as soon as legally possible” possible • “Part D. We will improve the efficiency with which we manage our holdings from the time they are scheduled through accessioning, processing, storage, preservation storage preservation, and public use.”
  6. 6. To Preserve or Not To Preserve? Digital representation of information Preservation & knowledge Information transfer ? AGENCY ARCHIVES Imaginations unbound
  7. 7. Do We Know the Answers? • (1) What is the granularity of information that one should preserve about a decision process in order to reconstruct it? • Example: the granularity of information collected from a decision process based on visual inspection of images has implications on storage and computational requirements/costs comp tational req irements/costs – ImageProvenance2Learn (IP2Learn)
  8. 8. Do We Know the Answers? • (2) Given thousands of DVDs with files, which files are related? • Example: given files that contain 2D scans of blue prints and 3D CAD models, find the p , content-based file correspondence - File2Learn prototype system Relationship Discovery 30 files 784 files
  9. 9. Do We Know the Answers? • (3) Given hundreds of versions of the ‘same’ file, which file version(s) are similar and which one(s) should be preserved? h ld b d? • Example: given a collection of Adobe PDF documents, documents compare all pairs of Adobe PDF documents containing text, images, vector graphics,… and order them chronologically or based on similarities - Doc2Learn prototype
  10. 10. Do We Know the Answers? • (4) Given thousands of file formats, which conversion software to use and which target file format to use so that the content of those thousands of files would be viewable in a long run? • Focus of today s talk is on examples today’s of technologies that would provide answers to (4) at large processing scale with computational scalability.
  11. 11. Goal • Ob Observation: Fil f ti File format conversions are t i inevitably one part of our daily life • Question: Can file format conversions assist in making digital content created today to be accessible and viewable throughout its lifecycle? • Consideration: we do not know what file formats will be around 100+ years down the y road • Goal: to make files backward and forward compatible
  12. 12. Background on File Format Conversions • A very large number of file formats in which digital content is stored. • A i An increasing number of complex fil f i b f l file formats containing t t i i multiple types of digital content (e.g., Adobe PDF, HDF) or having very elaborate specifications (e.g., STEP). • Many software implementations of import (read) and export (write) operations. • A wide spectrum of quality of software i l id t f lit f ft implementations t ti when reading and storing content in various file formats. • Ephemeral support for many file formats and software implementations • Hardware dependency of many software implementations
  13. 13. Illustration of 3D File Format Reality *.ma, * b * * *.mb, *.mp *.k3d k3d *.pdf (*.prc, *.u3d) *.w3d *.lwo *.c4d *.dwg *.blend *.iam *.max, *.3ds
  14. 14. Challenges and Objective • Challenges: • The quality of file format conversions is unknown when using a particular software to do the conversion • The volume of file format conversions requires significant computational resources • Understanding information loss due to file format conversions is application dependent • Estimating information loss is complicated due to the complexity of file formats • Th file f The fil format, software and hardware d t ft dh d dependencies are d i often unknown • Objective: Design and prototype services using a j g p yp g computational cloud to support forward-looking decisions
  15. 15. Parameters of File Format Conversions • File format: Content representation depends on a file format • Software: Retrieval and storage of content in a file format depends on the quality of software implementation • Hardware: Software execution depends o access a d a e So t a e e ecut o depe ds on to storage media, operating system, and hardware platform • Criteria defining information loss: Information loss due to file format conversions is defined by application specific criteria
  16. 16. Three Example Services of Interest • (a) Find file format conversion software to convert from any file format to any other file format • (b) Execute file format conversions with any available thi d party software il bl third t ft • (c) Evaluate information loss due to file ( ) format conversion over a set of files in multiple complex file formats
  17. 17. Technologies
  18. 18. Overview
  19. 19. #1: Conversion Software Registry (CSR) • Problem: Find file format conversion software to convert from any file format to any other file format • Technology: Conversion Software Registry (CSR) at https://isda.ncsa.uiuc.edu/NARA/CSR/ https://isda ncsa uiuc edu/NARA/CSR/ • Features: Support for searching, editing and adding i f ddi information about fil f ti b t file format t conversion software, open access and login- based modification b d difi ti
  20. 20. Movie of CSR
  21. 21. Comparison of CSR with Other Systems • File Format Registries • PRONOM developed by the National Archives of the United Kingdom g • Unified Digital Formats Registry (UDFR – before GDFR) • Software Registries/Catalogues • C Community specific it ifi • The Geotechnical and Geoenvironmental Software Directory (GGSD) • The Natural Language Software Registry (NLSR) • Business oriented • The Bit9 Global Software Registry ( g y (whitelisting software) g ) • Cnet (available software with links to feature descriptions) • File Format Conversion Registries • Th Planets test bed (password protected, 18 software packages) The Pl t t tb d( d t t d ft k )
  22. 22. Novelty of Conversion Software Registry • Existing file format registries focus on file format specifications • Catalogues of software focus on software of interest to a specific community and include information about t level d b t top l l description, vendors and price b t i ti d d i but not capabilities to import and export file formats • A file f fil format conversion registry lik Pl t i i t like Planets.org t supports 16 software packages, only single-hop conversion paths and couples software to the reg reg. • Novelty: CSR provides answers about multi-hop conversion paths from about 70+ software 70 packages currently Two-hop conversion path
  23. 23. #2: File Format Conversion Engine • Problem: Execute file format conversions with any available third party software • Technology: Polyglot version 1, operating on NCSA hardware resources resources, downloadable for private deployment • F t Features: web-based access t a bb d to computational cloud consisting of commodity h d dit hardware and i t ll ti d installations of f third party software with import/export capabilities biliti
  24. 24. Movie of Polyglot
  25. 25. Polyglot Design EXTENSIBILITY AUTOMATION Cloud Computing COMPUTATIONAL SCALABILITY Services to Archivists
  26. 26. Comparison of File Format Conversion Systems • Some existing file format conversion services • http://www.ps2pdf.com; p p p ; • Supports only certain conversion types • http://www.zamzar.com • Supports conversion of document, image, music, video and couple of CAD formats • http://media-convert.com • Supports about 20 multi-media formats • D Drawbacks: Th existing systems are not b k The i ti t t extensible (limited by specific libraries), cannot be downloaded for private use (files with sensitive info) info), computational scalability is unknown
  27. 27. Format Conversion Extensibility Via Software Reuse • Observation: Nobody has the resources to load every possible file format • Fully supporting the many available formats is an enormous undertaking • If a file format is closed/proprietary it may be difficult to retrieve the data directly from the file • Vendor file formats sometimes store application feature pp specific pieces of information that is not supported in other formats • M t software support importing/exporting of a subset of Most ft ti ti / ti f b t f application domain specific file formats. • Conclusion: Software reuse a d e te s b ty are t e key Co c us o So t a e euse and extensibility a e the ey characteristics of file format conversion systems
  28. 28. File Format Conversion Extensibility • Extensibility in Polyglot: Software is reused by wrapping 3rd party software while utilizing whatever access the software vendors make available to embedded f d k il bl b dd d functionality • published Application Programming Interface (API), (API) command line and Graphics User Interfaces (GUI) • Novelty: Polyglot p y yg provides a single user interface that g allows the user to execute multiple software conversion software applications automatically, and over distributed computers that have a license for the software needed to do the conversion and/or have the computing resources necessary for the size of the job (computational scalability).
  29. 29. #3: File Comparison Engines • Problem: Compare two files and evaluate information loss due to file format conversion over a set of files in multiple complex file formats • Technologies: g • Initial prototypes: ModelBrowser (four 3D comparison metrics); Doc2Learn (one metric across multiple digital objects), Doc2LearnHadoop (computation scalability using Hadoop) • Work-in-progress: A general API for content-based comparison of any two files - Versus
  30. 30. 3D Comparison Example (ModelBrowser) heart.stl • Software: Adobe 3D Reviewer heart.wrl h t l • Original File: WRL • Converted Files: STP, STL, IGS, U3D • Comparison Method: Light Fields [C e , 2003] compares e ds [Chen, 003] co pa es heart.stp heart stp silhouettes from various viewing angles around the objects Conclusion: Information loss(WRLSTP)=Information loss (WRLSTL)
  31. 31. Multiple Object Comparisons (Doc2Learn) Adobe PDF documents ~ {text, images, vector graphics, ….}
  32. 32. Multiple Method Comparisons (Versus) • Software: MS Paint • Original File: TIF • Converted Files: PNG, GIF, JPG, BMP • Comparison Method: Pixel by pixel difference (sum of Euclidean distances over all pixels) User Inputs Conclusion 1: Information loss(TIFBMP or TIFPNG) =0 Conclusion 2: Information loss(TIFGIF) > Information loss(TIFJPG)
  33. 33. Information Loss Evaluation Setup: • Inputs: a set of files, a set of software packages, p p g criteria for defining information loss • Wanted output: information loss ‘score’ per file format conversion Approach: • Phase I: Find all round-trip conversion paths from a given file format to the same file format • Phase II: Execute all conversions to obtain converted files. • Phase III: Compare the original and converted files
  34. 34. Information Loss Evaluation: Computational Requirements • Files: one file in STP file format • Software: Adobe 3D Reviewer, Cyberware PlyTool • Comparison Method: Light Fields [Chen, 2003] • Number of paths: 10 (28 individual conversions) Phase I: Find Phase III: Compare Phase II: Execute
  35. 35. Summary
  36. 36. Information Technology Lessons • Better understanding of preservation and reconstruction of electronic records in terms of file format conversions • Th data model needed f d The d t d l d d for documenting existing fil ti i ti file format conversion software • A framework (test bed) for software reuse and extensibility to provide file format conversion services • The complexity of performing content-based file comparison and measurements of information loss d i d t fi f ti l due to file format conversions • The computational cost of file format conversions, file comparisons and information loss evaluations • The computational scalability of file format conversions and fil comparisons using parallel processing paradigms d file i i ll l i di
  37. 37. The Value for Archivists • Prototype services are freely available to digital preservation community and provide decision support tools • to select an ‘optimal’ file format to be preserved • to evaluate file format conversion software • to select minimum cost for a chosen file format conversion path • The framework for conversion software documentation, , software reuse and functionality extensibility has a major impact on • Effi i Efficiency with which we manage our h ldi ith hi h holdings • Understanding of the information loss introduced due to conversions • The cost of updating file format conversion services
  38. 38. Development Plans • Prototype services are open to the public at • https://isda.ncsa.uiuc.edu/NARA/CSR/ • http://teeve3.ncsa.uiuc.edu/polyglot/convert.php • Software is open source technology and downloadable from http://isda.ncsa.uiuc.edu/download/ p • We have been building a second generation of these file format conversion services • Feedback is very welcome • Questions: Peter Bajcsy – j y pbajcsy@ncsa.uiuc.edu