SCAPE - Scalable Preservation Environments
Upcoming SlideShare
Loading in...5
×
 

SCAPE - Scalable Preservation Environments

on

  • 314 views

Ross King, Project Director of SCAPE, gave a short presentation of the EU funded project SCAPE, including descriptions of tools for planning and monitoring digital preservation, scalable computation ...

Ross King, Project Director of SCAPE, gave a short presentation of the EU funded project SCAPE, including descriptions of tools for planning and monitoring digital preservation, scalable computation and repositories, SCAPE Testbeds and where to learn more.
The presentation was given at the workshop ‘Preservation at Scale’ http://bit.ly/17ppAln in connection with the iPres2013 conference in Lissabon, Portugal, in September 2013.

Statistics

Views

Total Views
314
Views on SlideShare
314
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

SCAPE - Scalable Preservation Environments SCAPE - Scalable Preservation Environments Presentation Transcript

  • Dr. Ross King AIT Austrian Institute of Technology GmbH Preservation at Scale Workshop Lisbon, September 5, 2013 SCAPE Tools and Infrastructure for Preservation at Scale
  • • SCAPE Project • SCAPE Solutions • Scalable Planning • Scalable Tools • Scalable Computation • Scalable Repositories • SCAPE Testbeds • SCAPE Additional Information • Online Resources • Training Events • Contact Information 2 Outline This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • SCAPE – what is it about? • Planning and executing computing-intensive digital preservation processes such as the large-scale ingestion, characterisation or migration of large (multi-Terabyte) and complex data sets • SCAPE results include • Preservation scenarios • Preservation tools • Preservation workflows • Preservation infrastructure • Preservation best-practices SCAPE is a follow-up to the highly successful FP6 IP Planets. 3
  • SCAPE Project Data • Project instrument: FP7 Collaborative Project • 6. Call • Objective ICT-2009.4.1: Digital Libraries and Digital Preservation • Target outcome (a) Scalable systems and services for preserving digital content • 10. Call • Objective ICT-2013.11.4: Supplements to Strengthen Cooperation in ICT R&D in an Enlarged European Union • Duration: 42 44 months • February 2011 – July September 2014 • Budget: 11.3 12.0 Million Euro • Funded: 8.6 9.2 Million Euro 4
  • SCAPE Consortium 5
  • SCAPE Solutions 6
  • • SCOUT: an automated preservation watch system • Enables planning tool and decision makers to monitor the world and the organisation • Collects relevant knowledge and enable automated notification • Open and extensible • c3po: scalable content profiling • c3po analyses characterisation data based on fits • Scale-out MongoDB (100k/min/node) • Visual drill-down and well-documented profile • Automated sample selection • PLATO 4.1: scalable preservation planning • www.ifs.tuwien.ac.at/dp/plato • Technology upgrade - refactored, rebuilt, standardised, tested • New features • Groups allow collaborative planning • Integration of control policies for group • Quality domain – measures 7 Scalable Planning and Watch This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • • Tool Wrapper • Application that adapts existing tools to the SCAPE Platform • https://github.com/openplanets/scape-toolwrapper • Enhances wrapped tools • Standard naming scheme for CC, AS and QA tools • Standard invocation method (CLI) • Debian packages for easy deployment on the cluster • Support for data streaming (useful for Hadoop jobs) • Generates Preservation Components • Taverna workflows with embedded metadata for easy discovery • Automatic publication of components on myExperiment (to support discoverability) • Standard ports to enable composition of Preservation Components (based on well defined component profiles, CC, AS & QA) • Digital Preservation Toolkit • Software suite that contains a large set of DP tools • 77 operations in total • Easy to deploy on Linux machines (via apt-get) • apt - get i nst al l di gi t al - pr eser vat i on- t ool s 8 Scalable Tools This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • • Deployment of environments • XEN Hypervisor • Eucalyptus • Deployment of tools • Debian Packages • Tool Spec • Job Execution Service (JES) • Apache Oozie • Apache Hadoop 9 Scalable Computation This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). from digitalbevaring.dk User‐view on SCAPE development cloud at AIT: Eucalyptus web interface, Hybridfox browser add‐on, and terminal‐based interaction.
  • • Fedora 4.0.0 • All REST, no SOAP • RDF as first class objects • JCR 2.0 Implementation (ModeShape) • Infinispan distributed NoSQL datastore • Lily 2.0 • Built on top of HBase/HDFS • Integration of computation and storage 10 Scalable Repositories This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).
  • 11 SCAPE Architecture This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). Plan Management API Digital Object Repository Execution Platform JES Hadoop JES API Data Connector API Automated Watch Automated Planning PLATO Plan Management GUI Digital Objects/ Metadata Preservation Plan Store Plan Component Catalogue Component Lookup API Taverna Workbench Component Registration API Component Profile Validator Automated Watch Sources Push API Pull API Knowledge Source Adaptor Client Service Watch Request API Notification API Report API Assessment Data Publication Platform LDS3 APIData Loader Application
  • SCAPE Testbeds 12
  • SCAPE Testbeds • Large-scale Digital Repositories • Carry out large scale image migrations • The master files from legacy digitized image collections are typically TIFF files that can be costly to store due to their size. The cost benefit can only be realized if one can remove the original TIFFs and this can only be done if one can provide evidence of successful migration. (2.2 million pages, 80 TB) • Detect poor sound quality • In a collection of mp3 files (20 TB - 360.000 files) we have discovered files with very bad sound quality. Before ingesting everything into our DOMS we would like to be able to discover the bad files and potentially get those re-digitized from the original analogue media. • Research Data Sets • RAW to NEXUS conversion • There are file size and volume of content challenges identified for nexus files the raw to nexus format migration tool can be customised to account for various other types of experiment data files in the process of the migration. However, the scalability challenge here is that for different instrument specific to each facility), the other types of experiment data files vary significantly. 13 from digitalbevaring.dk See http://wiki.opf-labs.org/display/SP/Scenarios
  • SCAPE Testbeds • Web Content • Quality assurance in web harvesting • Web crawling is a process that is highly susceptible to errors. Often, essential data is missed by the crawler and thus not captured and preserved. Currently, quality assurance requires manual effort and because crawls often contain millions of pages, manual quality assurance will be neither very efficient • Data Centers • Anonymization of medical data • In order to fulfil the requirements for storing medical data in terms of safety and security, it will be necessary to develop encryption and anonymization services that will allow medical data transfer to a data center’s remote storage facilities. On one hand, the encryption techniques will be used to secure sensitive personal data (e.g. internal documents, patient databases) which must only be accessible from authorized services and users. On the other hand, the anonymization services will enable medical data (like x-ray generator outputs, x-ray computed tomography outputs, surgery recordings) being stored in the data center without having sensitive data attached. 14 from digitalbevaring.dk
  • SCAPE Additional Information 15
  • Additional Resources of Interest • Development Infrastructure • Code repository hosted by the Open Planets Foundation and GitHub • https://github.com/openplanets/scape/ • Development Wiki • http://wiki.opf-labs.org/display/SP/Home • Experimental Workflows • http://www.myexperiment.org/search?query=SCAPE&type=all&commit=Search • Publications • http://www.scape-project.eu/category/publication • Public Deliverables • http://www.scape-project.eu/category/deliverable • Tools • http://www.scape-project.eu/tools 16
  • SCAPE Training Events • Future Formats First: Application Infrastructures for Action Services • 16-17 September 2013, London • Registration: http://scape-future-formats-first.eventbrite.co.uk/ • Critical Path: Effective Evidence Based Preservation Planning • 13 November 2013, Aarhus • Hadoop-driven Digital Preservation (Hackathon) • 2-4 December 2013, Vienna 17 See http://www.scape-project.eu/events
  • SCAPE Contact Information • http://www.scape-project.eu/ • Twitter: #scapeproject • office@list.scape-project.eu • Dr. Ross King AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien 18
  • Thank you for your attention! Questions? 19