To Preserve Or Not To Preserve?


Published on

The Challenges in Appraising Electronic Records

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

To Preserve Or Not To Preserve?

  1. 1. To Preserve Or Not To Preserve? The Challenges in Appraising Electronic Records ect o c eco ds Peter Bajcsy, PhD - Research Scientist, NCSA - Adjunct Assistant Professor ECE & CS at UIUC - Associate Director Center for Humanities, Social Sciences and Arts (CHASS), Illinois Informatics Institute (I3), UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Date: January 21st, 2009
  2. 2. Acknowledgement • This research was partially supported by a National Archive and Records Administration (NARA) supplement ( ) pp to NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners. • The views and conclusions contained in this doc ment ie s concl sions document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archive and Records Administration, or the U.S. government. • Contributions by: Peter Bajcsy Kenton McHenry Rob Bajcsy, McHenry, Kooper, Michal Ondrejcek, William McFadden, Sang- Chul Lee, David Clutter and Alex Yahja Imaginations unbound
  3. 3. Outline • Introduction • Stakeholders • Conceptual Challenges • Some Open Problems • Research Examples Illustrating Open Problems • Summary Observations and Future Summary, Vision
  4. 4. Introduction • Two Trends in the Context of Decision Processes (Government, Medical, Natural Disasters, …) • Decision processes are moving from paper based to electronic record based (~ computer assisted decision processes) • Electronic records depend on rapidly changing information technology • Decisions are optimal depending on knowledge • Any learning from electronic records depends on preservation and reconstruction of the records, as well as on quality and granularity of the information National Center for Supercomputing Applications
  5. 5. Fundamental Problems • Limited learning from historical records today • It is often due to missing information and high uncertainty/ low quality of historical records. • Lack of understanding how to preserve and reconstruct data and decision processes. • It is due to insufficient forecasting/simulation capabilities. National Center for Supercomputing Applications
  6. 6. To Be Preserved! Digital representation of Preservation information i f ti & knowledge Information transfer ? AGENCY ARCHIVES Imaginations unbound
  7. 7. Motivation • The problems related to preservation of electronic records are only going to become more serious • Information becomes more heterogeneous and complex • More data types • Higher dimensional data • N New fil f file formats t • Volumes of electronic records have been increasing and will continue to grow • The model of a paperless office (4 years of Bush’s email > 8 years of Clinton’s email) • The paradigm shift to eScience • Digital information technology has been changing faster than any previous preservation media • The time scale of electronic media is ephemeral in comparison p p with paper or clay tablets Imaginations unbound
  8. 8. Example of Preservation Needs in Medicine • Short term: • Medical practice requires comparing patients’ records acquired today with the patients’ records f d from 5 10 50 or 70 years i order t 5, 10, 50, in d to assess functional, structural or low level biological changes due to diseases diseases, treatments and/or aging. • Long term: • Genealogy studies compare data sets over several hundreds and thousands of years y National Center for Supercomputing Applications
  9. 9. Who Are the Stakeholders? • Multiple institutions and organizations are active in the area of medical record preservation • National Library of Medicine (NLM) y ( ) • Research Information Network (RIN) • Medical Research Council (MRC) in UK • National Archives and Record Administration (NARA) • Identified common goals: • S Seamless, uninterrupted access t expanding collections l i t t d to di ll ti of biomedical data, medical knowledge, and health information • Preserve medical record collections in highly usable forms and contribute to comprehensive strategies for preservation of biomedical information in the U S and U.S. worldwide. National Center for Supercomputing Applications
  10. 10. Other Stakeholders • Government agencies • Prediction of patterns signaling natural disasters based on hi t i l measurements b d historical t • Detection of terrorist attacks based on past experience • Learning about other planets from past space shuttle missions • Preservation of cultural heritage • Companies • P Preservation of engineering d ti f i i drawings and i d architectural designs – Boeing, John Deere, GM • Preservation of simulation results – Caterpillar, Ford p , • Backward compatibility of hardware/software - GE Imaginations unbound
  11. 11. NARA as One of the Key Stakeholders • According to The Strategic Plan of The National Archives and Records Administration 2006–2016. “Preserving th Ad i i t ti 2006 2016 “P i the Past to Protect the Future” • “Strategic Goal: We will preserve and Strategic process records to ensure access by the p public as soon as legally p g y possible” • “D. We will improve the efficiency with which we manage our holdings from the time th are scheduled th th ti they h d l d through h accessioning, processing, storage, preservation, and public use.” use.
  12. 12. Conceptual Challenges • Learning Requires Reusing Electronic Records • How to enable and support preservation and reconstruction of electronic records? • Advancing Sensors and Instruments Leads to New Types of High Dimensional Data and Large Volumes • How to design preservation methodologies that scale well? • Process to Enable Learning over Time from Electronic Records Requires Large Financial Investments • How to minimize computational hardware, software, software and storage cost and maximize the amount of preserved information? National Center for Supercomputing Applications
  13. 13. What Are The Key Open Problems? Imaginations unbound
  14. 14. Some Open Problems -> Intellectual Merit • Appraisal Methodology • Appraisal by Visual Exploration • Support of Appraisals by Enabling Comparisons • Scalability of Appraisals with Increasing Heterogeneity of Information, Dimensionality of Data and Volume of Electronic Records • Support of Archival Decisions • Simulate Preservation Costs as a Function of Information Granularity and I f G l it d Information Technology ti T h l • Optimal Utilization of Computational and Human Resources • Automation of Processing for Preservation g • Discovery of Relationships Among Electronic Records • Information Preserving Conversions of Electronic Records • Sampling Authenticity and Integrity Verification of a Collection of Sampling, Temporally Changing Records Imaginations unbound
  15. 15. Broader Impacts Process to Enable Learning Over Time Electronic +$ Knowledge Records -$ Optimal Decision Making National Center for Supercomputing Applications
  16. 16. Concrete Research Examples Illustrating Open Problems p Imaginations unbound
  17. 17. Open Problems Related to Appraisal Methodology 1. Appraisal by Visual Exploration 2. Support of Appraisals by Enabling Comparisons 3. Scalability of Appraisals with Increasing Heterogeneity of Information, Dimensionality of Data and Volume of Electronic Records Imaginations unbound
  18. 18. Definition of Appraisal in Archival Context • Appraisal -- the process of determining the value and thus the final disposition of Federal records making them either records, temporary or permanent. • See p g mgmt/initiatives/appraisal.html • The basis of appraisal decisions may include • th records'' provenance and content, the d d t t • the records' authenticity and reliability, • the records‘ order and completeness, records completeness • the records‘ condition and costs to preserve them, and • the records‘ intrinsic value records Imaginations unbound
  19. 19. Open Problem 1: Appraisal by Visual Exploration • How to visualize the transition from raw data to information? • Raw data (Byte stream) -> Information 0F0 ->(R.G,B)->GREEN • How to encode and represent heterogeneous information for visual exploration and for computer assisted operations? computer-assisted • Encoding (e.g., shape consisting of a set of Bezier curves is encoded by a set of straight lines) • Representation (e.g., colors are represented by an ordered sequence of intensity values from all bands) • H How t summarize representations for visual exploration? to i t ti f i l l ti ? • Frequency of occurrence of primitives • Local and global summarizations Imaginations unbound
  20. 20. Example: Adobe Portable Document Format (PDF) • Why PDF? - PDF is just an example of a container • Office environment (Adobe PDF PS, MS Word, HTML …) PDF, PS Word HTML, ) • Satellite measurements (HDF, netCDF, …) 3D Adobe Library 6.0 Movie Adobe Lib Ad b Library 7 0 7.0 Imaginations unbound
  21. 21. Exploration of PDF Documents Using PDF Viewer • PDF Viewer presents information as a set of pages with their layouts • PDF Viewer renders layers of internal objects (components) and hence only the top layer is visible
  22. 22. Needed Exploration of PDF Components p p • There is no support for archival appraisals that would include visual exploration of components in a document (a container of components) • Needed viewers for appraisal analyses that present information stored in a container (e.g., PDF) as a set of components and their characteristics • Text – word frequency • Images (rasters) – color frequency (histogram) • Vector graphics – line frequency • Exploration for appraisal analyses needs to include visible and invisible objects
  23. 23. Exploration of Text Components LOADED FILES Occurrence of words Occurrence of numbers “Ignore” words
  24. 24. Exploration of Image Components LOADED FILES “Ignore” colors List of images Occurrence of colors Preview
  25. 25. Exploration of Vector Graphics Components LOADED FILES Preview Occurrence of v/h lines Imaginations unbound
  26. 26. Exploration of Visible And Invisible Objects Objects intersected at the mouse click location
  27. 27. Open Problem 2: Support of Appraisals by Enabling Comparisons • How to compare containers with heterogeneous information? i f ti ? • Methodology • Metrics • Weighting factors for fusion • How to quantify differences between the same type of information? • Encodings and Representations • Metrics • Local versus global differences Imaginations unbound
  28. 28. Comparisons Imaginations unbound
  29. 29. Methodology Partial solutions in literature -Ref. +… CAPTCHA Open problems +… Relationship to Permanent Records
  30. 30. Experimental Example INPUT = 10 PDF docs (4 & 6 Groups) UNIQUE ID= 1,2,3,4 UNIQUE ID= 5,6,7,8,9,10 Imaginations unbound
  31. 31. Comparative Experimental Results INPUT = 10 PDF docs (6 & 4 members in each Group) Vector-based similarity V b d i il i Text-based similarity Image-based similarity
  32. 32. Comparative Experimental Results Vector Graphics Similarity Portion of Document Surface and Word Similarity Combined Allotted to Each Document Feature Comparison Using Combination of Document Features in Proportion to Coverage
  33. 33. Accuracy Comparisons Method Average Average Average Similarity of Similarity of Similarity Across Group 1 Group 2 Group 1 & 2 TEXT ONLY 1 0.489 0 TEXT & IMAGE & 0.906 0 906 0.520 0 520 0.075 0 075 GRAPHICS One refers to high similarity & zero refers to low similarity g y y Conclusions: •Differences in similarity are up to 10% of the score •Documents in Group 2 would likely be misclassified as 0.5 similarity would be the threshold between similar and dissimilar documents Imaginations unbound
  34. 34. Open Problem 3: Scalability of Appraisals • Scalability of appraisals with increasing heterogeneity of information, dimensionality of data and volume of electronic records • H How should appraisal process change h ld i l h as 3D data is added to file containers? • H How should appraisal process change h ld i l h as 3D+time, 2D+spectrum, 3D+time+spectrum, nD, 3D+time+spectrum nD … • How should appraisal operations be designed to accommodate growing volume of electronic records? Imaginations unbound
  35. 35. Approaches to Computational Scalability of Document Appraisals • Options for parallel processing • message-passing interface (MPI) • MPI is d i i designed f the coordination of a program running as multiple d for h di i f i li l processes in a distributed memory environment by using passing control messages. • open multi-processing (OpenMP) multi processing • OpenMP is intended for shared memory machines. It uses a multithreading approach where the master threads forks any number of slave threads threads. • Map Reduce parallel programming paradigm for commodity clusters • It l t programmers write simple Map function and Reduce lets it i l M f ti dR d function, which are then automatically parallelized without requiring the programmers to code the details of parallel processes and communications • Specialized Hardware: FPGA, Cell processors, GPU Imaginations unbound
  36. 36. Computational Requirements for Executing the Methodology Yellow indicates computations Relationship to Permanent Records Appraisal & Sampling
  37. 37. Hardware & Software Dependencies with Hadoop • Test data: 15 PDF files from the Columbia investigation p g web site at • Software configuration: Linux OS (Ubuntu flavor) and the Hadoop implementation of Map and Reduce functionalities f nctionalities • Hardware configuration: homogeneous & heterogeneous machines g Hadoop Average Speed 60 50 nds 40 secon 30 average speed 20 10 0 1 2 3 4 5 #machines Homogeneous Hardware Heterogeneous Hardware Imaginations unbound
  38. 38. Open Problems Related to Archival Decisions •Simulate Preservation Costs as a Function of Information Granularity and Information Technology •Optimal Utilization of Computational and Human Resources Imaginations unbound
  39. 39. Open Problem: Archival Decision Support • Decision support for forecasting preservation costs • How to predict computational and storage p p g requirements of preservation as a function of technology variables and information gy granularity? • How to optimize computational hardware, software, storage, and networking investments? Imaginations unbound
  40. 40. Basic Questions About Information to be Preserved National Center for Supercomputing Applications
  41. 41. Challenges in Forecasting • Volatility of software/hardware/storage media • Updates: Windows operating systems since 2000: Two major new releases, two minor service pack updates, around fifty security , p p , y y patches since SP2 • Upgrades: Microsoft Office Pro for Windows 95/98/ME/2000/XP/2003/2007 • Media life expectancy: Optical ~5 years Disk ~ 15 years Microfiche ~ 5 years, years, 100, microfilm ~ 300, newspaper ~ 50, clay tablet ~ 10,000 (life expectancy vs. information density – [P. Conway, 1996] ) • Cost of software/hardware/storage media • Operating System: Windows 3.1/95/98/NT/2000/XP/Vista: Windows 95 = $209; Windows NT = $280; Windows XP = $300; Windows Vista = $399->$319 (2008) • 128 MB of SDRAM: Year 1999 ~ $120-> $40 -> $200 250 due to $120 > > $200-250 Earthquake in Taiwan -> March 2000 ~ $55->March 2007 ~ $8.96 (flash card) - (1TB ~$109.95 as of 01/15/2009) • High performance computers: 2006: DARPA awards approximately $500 million to Cray and IBM; 2007 NSF $200 million to NCSA/IBM National Center for Supercomputing Applications
  42. 42. Archival Decision Support • Lack of forecasting models to predict preservation costs • Our work: Understand the tradeoffs between information value and computational/storage costs by providing simulation frameworks • Information granularity, organization, compression, encryption, document format, ... • Versus • Cost of CPU for gathering information, for processing and for input/output operations; cost of storage media, upgrades, storage p p p ; g , pg , g room, … • Prototype simulation framework: Image Provenance To Learn available for downloading from
  43. 43. Simulation Framework Information Information Gathering and Retrieval and Decision Maker Storage Process Learning Preservation Reconstruction Value Provenance Provenance Information Information Value linear Value observed Cost (memory, CPU) Cost / Information Granularity Analysis Image Viewer Process Reconstruction System Information Gathering System National Center for Supercomputing Applications
  44. 44. Image Event Category Tracker Events Summary of Events Viewed Area Storage Time
  45. 45. Information Granularity National Center for Supercomputing Applications
  46. 46. Storage vs. Information Organization Tradeoffs: Test Case • Information granules include interpreted, raw and snapshots • Files were not compressed Event Name Saved Size Change Auto Zoom Change Gray Scale Change RGB Band Add Annotation Mouse Clicked Mouse Clicked -RDF= Resource Magnification Description Change Selection Window Hidden RDF Framework Change Gamma Key Pair Metadata Model Window Shown New Image Change Visible Region -Key pair = XML Change Zoom Factor Metadata Model Window Created 1 10 100 1000 10000 100000 1000000 10000000 Bytes (log scale) National Center for Supercomputing Applications
  47. 47. Open Problems Related to Automating Archival Processing for Preservation 1. Discovery of Relationships Among Electronic Records 2. Information Preserving Conversions of Electronic Records 3. Sampling, Authenticity and Integrity Verification of a Collection of Temporally Changing Records Imaginations unbound
  48. 48. Open Problem 1: Discovering Relationships Among Files • How should one establish relationships among electronic records coming from disparate sources or from the same source at multiple time instances? • How to extract metadata? • What ontology to use to represent the extracted metadata? • H How t automate metadata extraction from multiple data to t t t d t t ti f lti l d t types, e.g., 2D drawings and 3D CAD models? • How to discover relationships between electronic records corresponding to the same physical objects but different multidimensional observations? • Need to Understand the Complexity of the Problem Imaginations unbound
  49. 49. Metadata Extraction: Complexity & Size the Crandon Mine Reports p from 1981 till 2003 idx?type=browse&scope=ECONATRES.CRANDONMINE RDF t i l extracted using A t triples t t d i Aperture and visualized using RDF d i li d i RDF- Gravity (red – edges, green-literal values, violet – properties) Imaginations unbound
  50. 50. Relationships Among Multiple Data Types • Example Data: Torpedo Weapon Retriever 841 • 784 existing 2D image drawings and N>22 3D CAD models • How to establish relationships among the 3D CAD models and 2D image drawings during a product lifecycle? Hypothetical Distribution of 3D CAD models for TWR 841 Imaginations unbound
  51. 51. Understanding Challenges in Automation ry Relationship Discover D OCR Descriptors (metadata) Representation Imaginations unbound
  52. 52. Open Problem 2: Conversions of Electronic Records • Conversions of electronic records are needed because • Visual exploration depends on various software packages • Many formats are retired (deprecated) over time • A subset of formats is selected for preservation purposes • How to measure the degree of information g preservation when files are converted from format A to format B? • During conversions, information could be lost added or modified conversions lost, • What is the importance of each byte, object, etc. ? • How to introduce a framework for measuring the quality of conversion and visualization software? Imaginations unbound
  53. 53. Example: Conversion of X3D to STEP to X3D Software: X3dToVrml97 X3D Software: WRL A3D Reviewer Software: A3D Reviewer Software: Nothing! Vrml97ToX3d STEP WRL X3D
  54. 54. Automation of 3D File Format Mapping & Conversion Imaginations unbound
  55. 55. Open Problem 3: Sampling, Integrity and Authenticity g y y • Given finite resources and increasing amounts of electronic records, automation of sampling, integrity and authenticity verification is very much needed • What are the criteria for sampling a collection of temporally changing versions of ‘the same’ document? • Authenticity • Integrity • Information content • How to measure a degree of authenticity? • Computers might assign inaccurate time stamps to records • How to detect integrity failures? • A record containing a female patient with prostate cancer • How to incorporate constraints into sampling? • Storage space, compression computational cost, etc. Imaginations unbound
  56. 56. Example:Temporal Ranking and Integrity Verification • Chronological ranking based on time stamps of files fil • Last modification (current implementation) • Ranking can be changed by a human • Content referring to dates can be used for integrity verification TIME Imaginations unbound
  57. 57. Rules and Attributes for Integrity Verification • Document integrity attributes? • appearance or disappearance of document images • appearance and disappearance of dates embedded in documents • file size • count of image groups • number of sentences • average value of dates found in a document • Rules? Imaginations unbound
  58. 58. Summary • Introduced a set of open problems related to •AAppraisal of electronic records i l f l t i d • Archival forecasting of preservation costs • Automation of processing for preservation • Examples used for illustrating the open problems from our research just scratch the surface of some of the open problems bl
  59. 59. Observations • Many stakeholders are already aware of some of the open problems including government agencies and companies • As all government agencies have been computerized, the continuity and functioning of the agencies depend on preservation and reconstruction of electronic records • Right now, we are at the beginning of the exponential growth of electronic records (many more electronic records will be coming) • Some scientific fields are already facing real time decisions about preserving electronic records (e.g., astronomers) t )
  60. 60. Future Vision • It is envisioned that the preservation and reconstruction of electronic records have to follow different paradigms that incorporate • Scalability (heterogeneity, dimensionality and volume) ) • Forecasting of preservation costs • New level of automation and quality control in processing for preservation purposes • The field of electronic record management and preservation needs forward looking solutions to stay abreast with the dynamics y y of digital information Imaginations unbound
  61. 61. References to Presented Research • -Bajcsy P., R. Kooper and S-C. Lee, “Understanding Preservation and Reconstruction Requirements for Computer Assisted Decision Processes,” ACM Journal on Computers and Cultural Heritage (JOCCH), (submitted October 2008). • -Bajcsy P., “A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies,” Geography Bajcsy A Methodologies, Compass, Volume 2, Issue 6 (p 2040-2061), 2008 Blackwell Publishing Ltd, URL: bin/fulltext/121478978/PDFSTART • -Bajcsy P., R. Kooper, L. Marini and J. Myers, “Community-Scale Cyberinfrastructure for Exploratory Science,” In: Cyberinfrastructure Technologies and Applications book, Editor: Junwei Cao, Nova Science Publishers, Chapter 12, Inc., 2009; URL: ; p p gp p p p • - McHenry K. and P. Bajcsy quot;An Overview of 3D Data Content, File Formats and Viewers.quot;, Technical Report NCSA- ISDA08-002, October 31, 2008 • -McFadden W., K. McHenry, R. Kooper, M. Ondrejcek, A. Yahja and P. Bajcsy, “Advanced Information Systems for Archival Appraisals of Contemporary Documents,” the 4th IEEE International Conference on e-Science, December 8-12, 2008, Indianapolis, IN. , p , • -Lee S-C, W. McFadden and P. Bajcsy, “Text, Image and Vector Graphics Based Appraisal of Contemporary Documents,” The Seventh International Conference on Machine Learning and Applications, December 11-13, 2008, San Diego, CA. • -Bajcsy P. and S-C Lee, quot;Computer Assisted Appraisal of Contemporary PDF Documentsquot; ARCHIVES 2008: Archival R/Evolution & Identities 72nd Annual Meeting Pre-conference Programs: August 24-27, 2008, San Francisco, CA. g g g , , , • -Lee S-C. and P. Bajcsy, “Understanding Challenges in Preserving and Reconstructing Computer-Assisted Medical Decision Processes,” the Workshop on Machine Learning in Biomedicine and Bioinformatics (MLBB07) of the 2007 International Conference on Machine Learning and Application (ICMLA07), Cincinnati, Ohio, December 13-15, 2007. • -Bajcsy P and D. Clutter, “Gathering and Analyzing Information about Decision Making Processes Using Geospatial Electronic Records, the 2006 Winter Federation of Earth Science Information Partners (“Federation”) Conference, Records,” ( Federation ) poster, January 4-6, 2006 in Washington, DC. Imaginations unbound
  62. 62. Questions • Project URL: j and • Publications – see our URL at http://isda ncsa uiuc edu/publications • Peter Bajcsy; email: pbajcsy@ncsa uiuc edu