XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

680 views

Published on

Digital preservation deals with the problem of retaining the meaning of digital information over time to ensure its accessibility. The process often involves a workflow which transforms the digital objects. The workflow defines document pipelines containing transformations and validation checkpoints, either to facilitate migration for persistent archival or to extract metadata. The transformations, nevertheless, are computationally expensive, and therefore digital preservation can be out of reach for an organization whose core operation is not in data conservation. The operations described the document workflow, however, do not frequently reoccur. This paper combines an implementation-independent workflow designer with cloud computing to support small institution in their ad-hoc peak computing needs that stem from their efforts in digital preservation.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
680
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions

  1. 1. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Peter Wittek Swedish School of Library and Information Science University of Boras˚ 16/05/11
  2. 2. XML Processing in the Cloud: Large-Scale Digital Preservation in Small InstitutionsOutline 1 Workflows and Digital Preservation 2 Computational Requirements of Digital Preservation 3 Preservation Workflow in the Cloud 4 Experimental Results 5 Open Issues 6 Conclusions
  3. 3. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Workflows and Digital PreservationFundamental Issues in Digital Preservation Digital objects remain authentic and accessible Component and management failures Natural disasters Attacks Materials resulting from digital reformatting Information that is born-digital and has no analog counterpart
  4. 4. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Workflows and Digital PreservationMigration, Enrichment, and Other Approaches Keeping the content of legacy file formats accessible Most prominent with proprietary file formats Infrastructure-independent rendering of content Migration (legal issues) Dynamic collections: scalability Reuse Exploitation with a novel purpose Sufficient metadata at document and collection level
  5. 5. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Workflows and Digital PreservationAn Example of Enrichment: ToC Extraction
  6. 6. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Workflows and Digital PreservationPreserving the Pipeline Reuse of digital content asks for metadata on both the content and how it was transformed to its most recent form Document process preservation helps Architecture-independent description of the intent behind a document process
  7. 7. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Workflows and Digital PreservationAn XML Processing Pipeline
  8. 8. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Workflows and Digital PreservationDeployment Translation of abstract description of workflow Eclipse Modeling Framework generates Python source code Grid implementation using iRODS Integrated Rule-Oriented Data System Policy-based data grid software system Current experiment using Amazon Web Services
  9. 9. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Computational Requirements of Digital PreservationConversion Steps of a workflow are computationally expensive XSLT processors Processing a single large document tree can take hours Deep parsing and named entity recognition May involve high-complexity natural language processing Ad-hoc computations
  10. 10. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Computational Requirements of Digital PreservationLearning A step towards digital curation SaaS approach to digital curation Indexing by Lucene/Nutch Collection-level metadata extraction by Mahout
  11. 11. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Preservation Workflow in the CloudMapReduce and Deployment No internal dependencies for the processes Designed process is exported via the EMF interface to Python Simple MapReduce driver to execute the process on individual documents
  12. 12. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Preservation Workflow in the CloudThe Proposed Architecture
  13. 13. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Experimental ResultsCost 0.08 0.07 Average Cost in USD 0.06 0.05 0.04 100 0.03 1000 10000 0.02 0.01 0 1 4 10 20 40 80 Number of Processing Cores Figure: Comparison of average cost of computations with different collection sizes
  14. 14. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Experimental ResultsRunning time 8000 7000 Running Time (Mins) 6000 5000 4000 100 3000 1000 10000 2000 1000 0 1 4 10 20 40 80 Number of Processing Cores Figure: Comparison of running times with different collection sizes
  15. 15. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions Open IssuesObstacles to Adoption Persistence and high-reliability MapReduce Not just a technological issue Service-level agreement Particularly problematic Another EU FP7 project working on it: SLA@SOI Niche for alternative cloud providers
  16. 16. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions ConclusionsAcknowledgment Work has been funded by Sustaining Heritage Access through Multivalent ArchiviNg (SHAMAN), an EU FP7 large integrated project http://shaman-ip.eu/shaman/
  17. 17. XML Processing in the Cloud: Large-Scale Digital Preservation in Small Institutions ConclusionsSummary Digital preservation is an attractive area to be offered as SaaS Computational needs Expertise Complexity Since persistence requires architecture-independence, cloud adoption is straightforward High-reliability can be an issue Service-level agreements need further research

×