Your SlideShare is downloading. ×
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Scalable Services For Digital Preservation Ross King
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Scalable Services For Digital Preservation Ross King

504

Published on

3rd Annual WePreserve Conference Nice 2008

3rd Annual WePreserve Conference Nice 2008

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
504
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Planets Interoperability Framework Scalable Services for Digital Preservation DPE, Planets and CASPAR Third Annual Conference: Costs, Benefits and Motivations for Digital Preservation 30. October 2008 Ross King, Christian Sadilek, Rainer Schmidt Austrian Research Centers GmbH – ARC
  • 2. Outline  Planets Interoperability Framework  Grids and Clouds  Initial Experimental Results
  • 3. The Planets Interoperability Framework Motivation  There are a number of functions that all (or nearly all) software applications commonly need. These include functions such as • Data persistence • User management • Authentication and Authorization • Monitoring, Logging, and Notification  The Interoperability Framework (IF) software components provide these commonly required functions.
  • 4. The Planets Interoperability Framework  Defines an Service-Oriented Architecture for Digital Preservation  Set of Services, Interfaces, a common Data Model  Implements Common Services  Authentication and Authorization, Monitoring, Logging, Notification, …  Service Registration and Lookup  Provides APIs for Applications that use Planets  Testbed Experiments, Executing Preservation Plans  Provides Workflow Enactment Service and Engine  Components-based, XML serialization
  • 5. The Problem of Scalability  Planets is a preservation architecture based on Web Services  Supports interoperability and a distributed environment  Sufficient for a controlled experiments (Testbed)  Not sufficient for handling a production environment  Massively, uncontrolled user requests  Mass migration of hundreds of TBytes of data  Content Holders are faced with loosing vast amounts of data  Sufficient computational resources in-house?  There is a clear demand for incorporating Grid or Cloud Technology
  • 6. Integrating Virtual Clusters and Clouds  Basic Idea: Extending Planets SOA with Grid Services  The Planets IF Job Submission Services  Allow Job Submission to a PC cluster (e.g. Hadoop, Condor)  Grid approach/standards (SOAP, HPC-BP, JSDL)  Cluster nodes are instantiated from specific system images  Most Preservation Tools are 3rd party applications  Software need to be preinstalled on cluster nodes  Cluster and JSS be instantiated in-house (e.g. a PC lab) or on top of (leased) cloud resources (AWS EC2).  Computation be moved to data or vice-versa
  • 7. Integration – Planets Tiered Architecture Preservation Planning reference Tools + Planets and Service Format Workflow Generation Registry Services Registry Experimental lookup Environment (qualitative) 3rd Party Tool Workbench for Execution Services Testbed Experiments Engine Workflow Def. maintain Grid/Cloud Execution Services Web Portal Metadata Data Model Browser Store Production Environment Web/Grid Clients Planets Service Context Resources Services
  • 8. Experimental Setup  Amazon Elastic Compute Cloud (EC2)  1 – 150 cluster nodes  Custom image based on RedHat Fedora 8 i386  Amazon Simple Storage Service (S3)  max. 1TB I/O,  ~32,5MBit/s download / ~13,8MBit/s upload (cloud internally)  Apache Hadoop (v.0.18)  MapReduce Implementation  Pre-installed command line tools (e.g, ps2pdf )
  • 9. Preservation Planning reference Tools + Planets and Service Format Workflow Generation Registry Services Registry lookup 3rd Party Tool Workbench for Execution Services Testbed Experiments Engine Workflow Def. maintain Grid/Cloud Execution Services Web Portal Metadata Data Model Browser Store Web/Grid Clients Planets Service Context Resources Services
  • 10. Experimental Setup Virtual Cluster (Apache Hadoop) Cloud Infrastructure (EC2) Job JSDL JSS Virtual Node Job Description File (Xen) Storage Infrastructure (S3) Raw Data Data Transfer Service
  • 11. Experimental Results 1 – Scaling Job Size x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 10,00 number of nodes = 5 9,00 8,00 EC2 0,07 MB 7,00 EC2 7,5 MB time [min] 6,00 EC2 250 MB 5,00 SLE 0,07 MB 4,00 3,00 SLE 7,5 MB 2,00 SLE 250 MB 1,00 0,00 1 10 100 1000 tasks x(1k) = t_seq / t_parallel and tasks = 1000
  • 12. Experimental Results 2 – Scaling #Nodes 40,00 n=1, t=36, s = 0.72, e=72% X 35,00 30,00 n=1 (local), s1=1, t=26 X 25,00 time [min] EC2 1000 x 0,07 MB 20,00 n=5, t=8, s=3.25, e=65% SLE 1000 x 0,07 MB 15,00 n=10, t=4.5, s=5.8, e=58% 10,00 X n=50, t=1.68, s=15.5, e=31% 5,00 X n=100, t=1.03, s=25.2, e=25% 0,00 X X 0 50 100 150 nodes
  • 13. Conclusions  Preservation systems will need to employ Grid/Cloud resources  Therefore there is a need to bridge communities in the areas of digital libraries and e-science.  Cloud and virtual infrastructures provide a powerful solution for obtaining on-demand access to computational resources.  Planets IF Job Submission Service provides a first step  Submission to virtual cluster of preservation nodes using Grid protocols.  Performance scales roughly with the number of nodes, accounting for expected overheads  Many open issues remain! Security, reliability, standardization, legal aspects...
  • 14. Thank you for your attention!  Planets Project http://www.planets-project.eu  Contacts Ross King ross.king@arcs.ac.at Rainer Schmidt rainer.schmidt@arcs.ac.at
  • 15. Sample JSDL code <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <jsdl:JobDefinition xmlns=quot;http://www.example.org/quot; xmlns:jsdl=quot;http://schemas.ggf.org/jsdl/2005/11/jsdlquot; xmlns:jsdl-posix=quot;http://schemas.ggf.org/jsdl/2005/11/jsdl-posixquot; xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot; xsi:schemaLocation=quot;http://schemas.ggf.org/jsdl/2005/11/jsdl jsdl.xsd quot;> <jsdl:JobDescription> <jsdl:JobIdentification> <jsdl:JobName>start vi</jsdl:JobName> </jsdl:JobIdentification> <jsdl:Application> <jsdl:ApplicationName>ls</jsdl:ApplicationName> <jsdl-posix:POSIXApplication> <jsdl-posix:Executable>/bin/ls</jsdl-posix:Executable> <jsdl-posix:Argument>-la file.txt</jsdl-posix:Argument> <jsdl-posix:Environment name=quot;LD_LIBRARY_PATHquot;>/usr/local/lib</jsdl-posix:Environment> <jsdl-posix:Input>/dev/null</jsdl-posix:Input> <jsdl-posix:Output>stdout.${JOB_ID}</jsdl-posix:Output> <jsdl-posix:Error>stderr.${JOB_ID}</jsdl-posix:Error> </jsdl-posix:POSIXApplication> </jsdl:Application> </jsdl:JobDescription> </jsdl:JobDefinition>
  • 16. Map-Reduce for Migrating Digital Objects  Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes.  Automated decomposition (split)  Mapping to intermediary pairs (map), optionally (combine)  Merge output (reduce)  Provides implementation for data parallel problems, i/o intensive,  Example: Conversion digital object (e.g website, folder, archive)  Decompose into atomic pieces (e.g. file, image, movie)  On each node, convert piece to target format  Merge pieces and create new data object

×