To Preserve Or Not To Preserve?

To Preserve Or Not
To Preserve?
The Challenges in
Appraising
Electronic Records
ect o c eco ds
Peter Bajcsy, PhD
- Research Scientist, NCSA
- Adjunct Assistant Professor ECE & CS at
UIUC
- Associate Director Center for Humanities,
Social Sciences and Arts (CHASS), Illinois
Informatics Institute (I3), UIUC

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign

Date: January 21st, 2009

Acknowledgement

• This research was partially supported by a National
Archive and Records Administration (NARA) supplement
( ) pp
to NSF PACI cooperative agreement CA #SCI-9619019
and NCSA Industrial Partners.
• The views and conclusions contained in this doc ment
ie s concl sions document
are those of the authors and should not be interpreted as
representing the official policies, either expressed or
implied, of the National Archive and Records
Administration, or the U.S. government.
• Contributions by: Peter Bajcsy Kenton McHenry Rob
Bajcsy, McHenry,
Kooper, Michal Ondrejcek, William McFadden, Sang-
Chul Lee, David Clutter and Alex Yahja

Imaginations unbound

Outline

• Introduction
• Stakeholders
• Conceptual Challenges
• Some Open Problems
• Research Examples Illustrating Open
Problems
• Summary Observations and Future
Summary,
Vision

Introduction
• Two Trends in the Context of Decision Processes
(Government, Medical, Natural Disasters, …)
• Decision processes are moving from paper based
to electronic record based (~ computer assisted
decision processes)
• Electronic records depend on rapidly changing
information technology
• Decisions are optimal depending on knowledge
• Any learning from electronic records depends on
preservation and reconstruction of the records, as
well as on quality and granularity of the information


Fundamental Problems

• Limited learning from historical records
today
• It is often due to missing information and
high uncertainty/ low quality of historical
records.
• Lack of understanding how to preserve and
reconstruct data and decision processes.
• It is due to insufficient
forecasting/simulation capabilities.


To Be Preserved!
Digital
representation of Preservation
information
i f ti
& knowledge

Information
transfer ?

AGENCY ARCHIVES

Motivation
• The problems related to preservation of electronic records
are only going to become more serious
• Information becomes more heterogeneous and complex
• More data types
• Higher dimensional data
• N
New fil f
file formats
t
• Volumes of electronic records have been increasing and will
continue to grow
• The model of a paperless office (4 years of Bush’s email > 8
years of Clinton’s email)
• The paradigm shift to eScience
• Digital information technology has been changing faster than
any previous preservation media
• The time scale of electronic media is ephemeral in comparison
p p
with paper or clay tablets


Example of Preservation Needs in Medicine

• Short term:
• Medical practice requires comparing patients’
records acquired today with the patients’
records f
d from 5 10 50 or 70 years i order t
5, 10, 50, in d to
assess functional, structural or low level
biological changes due to diseases
diseases,
treatments and/or aging.
• Long term:
• Genealogy studies compare data sets over
several hundreds and thousands of years
y


Who Are the Stakeholders?
• Multiple institutions and organizations are active in the area
of medical record preservation
• National Library of Medicine (NLM)
y ( )
• Research Information Network (RIN)
• Medical Research Council (MRC) in UK
• National Archives and Record Administration (NARA)
• Identified common goals:
• S
Seamless, uninterrupted access t expanding collections
l i t t d to di ll ti
of biomedical data, medical knowledge, and health
information
• Preserve medical record collections in highly usable
forms and contribute to comprehensive strategies for
preservation of biomedical information in the U S and
U.S.
worldwide.

Other Stakeholders
• Government agencies
• Prediction of patterns signaling natural disasters
based on hi t i l measurements
b d historical t
• Detection of terrorist attacks based on past
experience
• Learning about other planets from past space shuttle
missions
• Preservation of cultural heritage
• Companies
• P
Preservation of engineering d
ti f i i drawings and
i d
architectural designs – Boeing, John Deere, GM
• Preservation of simulation results – Caterpillar, Ford
p ,
• Backward compatibility of hardware/software - GE

NARA as One of the Key Stakeholders
• According to The Strategic Plan of The
National Archives and Records
Administration 2006–2016. “Preserving th
Ad i i t ti 2006 2016 “P i the
Past to Protect the Future”
• “Strategic Goal: We will preserve and
Strategic
process records to ensure access by the
p
public as soon as legally p
g y possible”
• “D. We will improve the efficiency with
which we manage our holdings from
the time th are scheduled th
th ti they h d l d through h
accessioning, processing, storage,
preservation, and public use.”
use.

Conceptual Challenges
• Learning Requires Reusing Electronic Records
• How to enable and support preservation and
reconstruction of electronic records?
• Advancing Sensors and Instruments Leads to New
Types of High Dimensional Data and Large Volumes
• How to design preservation methodologies that
scale well?
• Process to Enable Learning over Time from
Electronic Records Requires Large Financial
Investments
• How to minimize computational hardware,
software,
software and storage cost and maximize the
amount of preserved information?

What Are The Key Open Problems?


Some Open Problems -> Intellectual Merit
• Appraisal Methodology
• Appraisal by Visual Exploration
• Support of Appraisals by Enabling Comparisons
• Scalability of Appraisals with Increasing Heterogeneity of
Information, Dimensionality of Data and Volume of Electronic
Records
• Support of Archival Decisions
• Simulate Preservation Costs as a Function of Information
Granularity and I f
G l it d Information Technology
ti T h l
• Optimal Utilization of Computational and Human Resources
• Automation of Processing for Preservation
g
• Discovery of Relationships Among Electronic Records
• Information Preserving Conversions of Electronic Records
• Sampling Authenticity and Integrity Verification of a Collection of
Sampling,
Temporally Changing Records

Broader Impacts
Process to Enable
Learning Over Time
Electronic +$ Knowledge
Records

-$

Optimal Decision Making


Concrete Research Examples Illustrating
Open Problems
p


Open Problems Related to Appraisal
Methodology
1. Appraisal by Visual Exploration
2. Support of Appraisals by Enabling Comparisons
3. Scalability of Appraisals with Increasing Heterogeneity of
Information, Dimensionality of Data and Volume of Electronic
Records


Definition of Appraisal in Archival Context

• Appraisal -- the process of determining the value and thus
the final disposition of Federal records making them either
records,
temporary or permanent.
• See http://www.archives.gov/records-
p g
mgmt/initiatives/appraisal.html
• The basis of appraisal decisions may include
• th records'' provenance and content,
the d d t t
• the records' authenticity and reliability,
• the records‘ order and completeness,
records completeness
• the records‘ condition and costs to preserve them, and
• the records‘ intrinsic value
records


Open Problem 1: Appraisal by Visual
Exploration

• How to visualize the transition from raw data to information?
• Raw data (Byte stream) -> Information 0F0 ->(R.G,B)->GREEN
• How to encode and represent heterogeneous information for
visual exploration and for computer assisted operations?
computer-assisted
• Encoding (e.g., shape consisting of a set of Bezier
curves is encoded by a set of straight lines)
• Representation (e.g., colors are represented by an
ordered sequence of intensity values from all bands)
• H
How t summarize representations for visual exploration?
to i t ti f i l l ti ?
• Frequency of occurrence of primitives
• Local and global summarizations


Example: Adobe Portable Document
Format (PDF)
• Why PDF? - PDF is just an example of a container
• Office environment (Adobe PDF PS, MS Word, HTML …)
PDF, PS Word HTML, )
• Satellite measurements (HDF, netCDF, …)

3D
Adobe Library 6.0

Movie
Adobe Lib
Ad b Library 7 0
7.0


Exploration of PDF Documents Using PDF
Viewer
• PDF Viewer presents information as a set of pages with
their layouts
• PDF Viewer renders layers of internal objects
(components) and hence only the top layer is visible

Needed Exploration of PDF Components
p p
• There is no support for archival appraisals that would
include visual exploration of components in a document
(a container of components)

• Needed viewers for appraisal analyses that present
information stored in a container (e.g., PDF) as a set of
components and their characteristics
• Text – word frequency
• Images (rasters) – color frequency (histogram)
• Vector graphics – line frequency
• Exploration for appraisal analyses needs to include
visible and invisible objects

Exploration of Text Components

LOADED FILES
Occurrence of words Occurrence of numbers
“Ignore” words

Exploration of Image Components

LOADED FILES “Ignore” colors

List of images Occurrence of colors Preview

Exploration of Vector Graphics
Components

LOADED FILES
Preview Occurrence of v/h lines


Exploration of Visible And Invisible Objects

Objects intersected at the
mouse click location

Open Problem 2: Support of Appraisals
by Enabling Comparisons
• How to compare containers with heterogeneous
information?
i f ti ?
• Methodology
• Metrics
• Weighting factors for fusion
• How to quantify differences between the same
type of information?
• Encodings and Representations
• Metrics
• Local versus global differences

Comparisons


Methodology
Partial
solutions in
literature
-Ref.
+…
CAPTCHA

Open
problems

+…
Relationship to
Permanent Records

Experimental Example
INPUT = 10 PDF docs (4 & 6 Groups)
UNIQUE ID= 1,2,3,4 UNIQUE ID= 5,6,7,8,9,10


Comparative Experimental Results

INPUT = 10 PDF docs
(6 & 4 members in each Group)

Vector-based similarity
V b d i il i

Text-based similarity Image-based similarity

Comparative Experimental Results

Vector Graphics Similarity Portion of Document Surface
and Word Similarity Combined Allotted to Each Document Feature

Comparison Using
Combination of Document
Features in Proportion to
Coverage

Accuracy Comparisons

Method Average Average Average
Similarity of Similarity of Similarity Across
Group 1 Group 2 Group 1 & 2
TEXT ONLY 1 0.489 0
TEXT & IMAGE & 0.906
0 906 0.520
0 520 0.075
0 075
GRAPHICS

One refers to high similarity & zero refers to low similarity
g y y

Conclusions:
•Differences in similarity are up to 10% of the score
•Documents in Group 2 would likely be misclassified as 0.5
similarity would be the threshold between similar and
dissimilar documents

Open Problem 3: Scalability of
Appraisals
• Scalability of appraisals with increasing
heterogeneity of information,
dimensionality of data and volume of
electronic records
• H
How should appraisal process change
h ld i l h
as 3D data is added to file containers?
• H
How should appraisal process change
h ld i l h
as 3D+time, 2D+spectrum,
3D+time+spectrum, nD,
3D+time+spectrum nD …
• How should appraisal operations be
designed to accommodate growing
volume of electronic records?

Approaches to Computational Scalability of
Document Appraisals
• Options for parallel processing
• message-passing interface (MPI)
• MPI is d i
i designed f the coordination of a program running as multiple
d for h di i f i li l
processes in a distributed memory environment by using passing
control messages.
• open multi-processing (OpenMP)
multi processing
• OpenMP is intended for shared memory machines. It uses a
multithreading approach where the master threads forks any
number of slave threads
threads.
• Map Reduce parallel programming paradigm for commodity
clusters
• It l t programmers write simple Map function and Reduce
lets it i l M f ti dR d
function, which are then automatically parallelized without
requiring the programmers to code the details of parallel
processes and communications
• Specialized Hardware: FPGA, Cell processors, GPU

Computational
Requirements for
Executing the
Methodology

Yellow indicates
computations

Relationship to
Permanent Records

Appraisal & Sampling

Hardware & Software Dependencies with
Hadoop
• Test data: 15 PDF files from the Columbia investigation
p g
web site at http://caib.nasa.gov/.
• Software configuration: Linux OS (Ubuntu flavor) and
the Hadoop implementation of Map and Reduce
functionalities
f nctionalities
• Hardware configuration: homogeneous &
heterogeneous machines
g
Hadoop Average Speed

60
50
nds

40
secon

30 average speed
20
10
0
1 2 3 4 5

#machines

Homogeneous Hardware Heterogeneous Hardware

Open Problems Related to Archival
Decisions
•Simulate Preservation Costs as a Function of Information
Granularity and Information Technology
•Optimal Utilization of Computational and Human
Resources

Open Problem: Archival Decision Support

• Decision support for forecasting preservation
costs
• How to predict computational and storage
p p g
requirements of preservation as a function
of technology variables and information
gy
granularity?
• How to optimize computational hardware,
software, storage, and networking
investments?

Basic Questions About Information to be
Preserved


Challenges in Forecasting
• Volatility of software/hardware/storage media
• Updates: Windows operating systems since 2000: Two major new
releases, two minor service pack updates, around fifty security
, p p , y y
patches since SP2
• Upgrades: Microsoft Office Pro for Windows
95/98/ME/2000/XP/2003/2007
• Media life expectancy: Optical ~5 years Disk ~ 15 years Microfiche ~
5 years, years,
100, microfilm ~ 300, newspaper ~ 50, clay tablet ~ 10,000 (life
expectancy vs. information density – [P. Conway, 1996] )
• Cost of software/hardware/storage media
• Operating System: Windows 3.1/95/98/NT/2000/XP/Vista: Windows
95 = $209; Windows NT = $280; Windows XP = $300; Windows Vista =
$399->$319 (2008)
• 128 MB of SDRAM: Year 1999 ~ $120-> $40 -> $200 250 due to
$120 > > $200-250
Earthquake in Taiwan -> March 2000 ~ $55->March 2007 ~ $8.96
(flash card) - www.pricewatch.com (1TB ~$109.95 as of 01/15/2009)
• High performance computers: 2006: DARPA awards approximately
$500 million to Cray and IBM; 2007 NSF $200 million to NCSA/IBM


Archival Decision Support

• Lack of forecasting models to predict preservation costs

• Our work: Understand the tradeoffs between information
value and computational/storage costs by providing
simulation frameworks
• Information granularity, organization, compression, encryption,
document format, ...
• Versus
• Cost of CPU for gathering information, for processing and for
input/output operations; cost of storage media, upgrades, storage
p p p ; g , pg , g
room, …
• Prototype simulation framework: Image Provenance To
Learn available for downloading from
http://isda.ncsa.uiuc.edu

Simulation Framework
Information Information
Gathering and Retrieval and
Decision Maker Storage Process Learning
Preservation
Reconstruction
Value

Provenance Provenance
Information Information

Value

linear

Value
observed

Cost (memory, CPU)

Cost / Information Granularity
Analysis

Image Viewer Process Reconstruction System

Information Gathering System

Image Event Category Tracker

Events

Summary
of Events

Viewed
Area

Storage

Time

Information Granularity


Storage vs. Information Organization
Tradeoffs: Test Case
• Information granules include interpreted, raw and snapshots
• Files were not compressed

Event Name
Saved Size
Change Auto Zoom
Change Gray Scale
Change RGB Band
Add Annotation
Mouse Clicked
Mouse Clicked -RDF= Resource
Magnification Description
Change Selection
Window Hidden RDF
Framework
Change Gamma
Key Pair
Metadata Model
Window Shown
New Image
Change Visible Region -Key pair = XML
Change Zoom Factor Metadata Model
Window Created

1 10 100 1000 10000 100000 1000000 10000000

Bytes (log scale)


Open Problems Related to Automating
Archival Processing for Preservation
1. Discovery of Relationships Among Electronic Records
2. Information Preserving Conversions of Electronic Records
3. Sampling, Authenticity and Integrity Verification of a Collection
of Temporally Changing Records


Open Problem 1: Discovering
Relationships Among Files
• How should one establish relationships among electronic
records coming from disparate sources or from the same
source at multiple time instances?
• How to extract metadata?
• What ontology to use to represent the extracted
metadata?
• H
How t automate metadata extraction from multiple data
to t t t d t t ti f lti l d t
types, e.g., 2D drawings and 3D CAD models?
• How to discover relationships between electronic records
corresponding to the same physical objects but different
multidimensional observations?
• Need to Understand the Complexity of the Problem

Metadata Extraction: Complexity & Size

the Crandon Mine Reports
p
from 1981 till 2003
http://digicoll.library.wisc.edu/cgi-bin/EcoNatRes/EcoNatRes-
idx?type=browse&scope=ECONATRES.CRANDONMINE

RDF t i l extracted using A t
triples t t d i Aperture and visualized using RDF
d i li d i RDF-
Gravity (red – edges, green-literal values, violet – properties)

Relationships Among Multiple Data Types
• Example Data: Torpedo Weapon Retriever 841
• 784 existing 2D image drawings and N>22 3D CAD
models
• How to establish relationships among the 3D
CAD models and 2D image drawings during a
product lifecycle?

Hypothetical Distribution of 3D CAD models for
TWR 841

Understanding Challenges in Automation

ry
Relationship Discover
D
OCR
Descriptors (metadata)
Representation


Open Problem 2: Conversions of
Electronic Records
• Conversions of electronic records are needed because
• Visual exploration depends on various software
packages
• Many formats are retired (deprecated) over time
• A subset of formats is selected for preservation
purposes
• How to measure the degree of information
g
preservation when files are converted from format A to
format B?
• During conversions, information could be lost added or modified
conversions lost,
• What is the importance of each byte, object, etc. ?
• How to introduce a framework for measuring the
quality of conversion and visualization software?

Example: Conversion of X3D to STEP to X3D

Software:
X3dToVrml97

X3D Software: WRL
A3D Reviewer

Software:
A3D Reviewer
Software: Nothing!
Vrml97ToX3d

STEP WRL X3D

Automation of 3D File Format Mapping &
Conversion


Open Problem 3: Sampling,
Integrity and Authenticity
g y y
• Given finite resources and increasing amounts of electronic
records, automation of sampling, integrity and authenticity
verification is very much needed
• What are the criteria for sampling a collection of temporally
changing versions of ‘the same’ document?
• Authenticity
• Integrity
• Information content
• How to measure a degree of authenticity?
• Computers might assign inaccurate time stamps to records
• How to detect integrity failures?
• A record containing a female patient with prostate cancer
• How to incorporate constraints into sampling?
• Storage space, compression computational cost, etc.

Example:Temporal Ranking and Integrity
Verification
• Chronological ranking
based on time stamps of
files
fil
• Last modification (current
implementation)
• Ranking can be
changed by a human
• Content referring to
dates can be used for
integrity verification

TIME

Rules and Attributes for Integrity Verification
• Document integrity attributes?
• appearance or disappearance of document images
• appearance and disappearance of dates embedded in
documents
• file size
• count of image groups
• number of sentences
• average value of dates found in a document
• Rules?


Summary
• Introduced a set of open problems
related to
•AAppraisal of electronic records
i l f l t i d
• Archival forecasting of preservation
costs
• Automation of processing for
preservation

• Examples used for illustrating the open
problems from our research just
scratch the surface of some of the open
problems
bl

Observations
• Many stakeholders are already aware of some of the
open problems including government agencies and
companies

• As all government agencies have been
computerized, the continuity and functioning of the
agencies depend on preservation and reconstruction
of electronic records

• Right now, we are at the beginning of the
exponential growth of electronic records (many more
electronic records will be coming)

• Some scientific fields are already facing real time
decisions about preserving electronic records (e.g.,
astronomers)
t )

Future Vision
• It is envisioned that the preservation and
reconstruction of electronic records have to
follow different paradigms that incorporate
• Scalability (heterogeneity, dimensionality
and volume) )
• Forecasting of preservation costs
• New level of automation and quality
control in processing for preservation
purposes
• The field of electronic record management
and preservation needs forward looking
solutions to stay abreast with the dynamics
y y
of digital information

References to Presented Research

• -Bajcsy P., R. Kooper and S-C. Lee, “Understanding Preservation and Reconstruction Requirements for Computer
Assisted Decision Processes,” ACM Journal on Computers and Cultural Heritage (JOCCH), (submitted October 2008).
• -Bajcsy P., “A Perspective on Cyberinfrastructure for Water Research Driven by Informatics Methodologies,” Geography
Bajcsy A Methodologies,
Compass, Volume 2, Issue 6 (p 2040-2061), 2008 Blackwell Publishing Ltd, URL: http://www3.interscience.wiley.com/cgi-
bin/fulltext/121478978/PDFSTART
• -Bajcsy P., R. Kooper, L. Marini and J. Myers, “Community-Scale Cyberinfrastructure for Exploratory Science,” In:
Cyberinfrastructure Technologies and Applications book, Editor: Junwei Cao, Nova Science Publishers, Chapter 12, Inc.,
2009; URL: https://www.novapublishers.com/catalog/product_info.php?products_id=8011
; p p gp p p p
• - McHenry K. and P. Bajcsy quot;An Overview of 3D Data Content, File Formats and Viewers.quot;, Technical Report NCSA-
ISDA08-002, October 31, 2008
• -McFadden W., K. McHenry, R. Kooper, M. Ondrejcek, A. Yahja and P. Bajcsy, “Advanced Information Systems for
Archival Appraisals of Contemporary Documents,” the 4th IEEE International Conference on e-Science, December 8-12,
2008, Indianapolis, IN.
, p ,
• -Lee S-C, W. McFadden and P. Bajcsy, “Text, Image and Vector Graphics Based Appraisal of Contemporary
Documents,” The Seventh International Conference on Machine Learning and Applications, December 11-13, 2008, San
Diego, CA.
• -Bajcsy P. and S-C Lee, quot;Computer Assisted Appraisal of Contemporary PDF Documentsquot; ARCHIVES 2008: Archival
R/Evolution & Identities 72nd Annual Meeting Pre-conference Programs: August 24-27, 2008, San Francisco, CA.
g g g , , ,
• -Lee S-C. and P. Bajcsy, “Understanding Challenges in Preserving and Reconstructing Computer-Assisted Medical
Decision Processes,” the Workshop on Machine Learning in Biomedicine and Bioinformatics (MLBB07) of the 2007
International Conference on Machine Learning and Application (ICMLA07), Cincinnati, Ohio, December 13-15, 2007.
• -Bajcsy P and D. Clutter, “Gathering and Analyzing Information about Decision Making Processes Using Geospatial
Electronic Records, the 2006 Winter Federation of Earth Science Information Partners (“Federation”) Conference,
Records,” ( Federation )
poster, January 4-6, 2006 in Washington, DC.


Questions

• Project URL:
j
http://isda.ncsa.uiuc.edu/NARA/index.html
and http://isda.ncsa.uiuc.edu/CompTradeoffs/

• Publications – see our URL at
http://isda.ncsa.uiuc.edu/publications
http://isda ncsa uiuc edu/publications

• Peter Bajcsy; email: pbajcsy@ncsa uiuc edu
pbajcsy@ncsa.uiuc.edu

To Preserve Or Not To Preserve?

Recommended

Recommended

More Related Content

Similar to To Preserve Or Not To Preserve?

Similar to To Preserve Or Not To Preserve? (20)

Recently uploaded

Recently uploaded (20)

To Preserve Or Not To Preserve?