Dr. Ross King, AIT Austrian Institute of Technology GmbH, gave an invited talk about the FP7 project SCAPE at the eSciDoc Days in Berlin, October 27, 2011, https://www.escidoc.org/JSPWiki/en/ESciDocDays.
SCAPE - Building Digital Preservation Infrastructure
SCAPESCAPEBuilding Digital Preservation InfrastructureDr. Ross KingAIT Austrian Institute of Technology GmbHeSciDoc DaysBerlin, October 27, 2011
SCAPE Digital Preservation• For the first time, the rate of increase of information creation is beginning to exceed the rate of increase in storage capacity.• This massive volume of digital material raises a number of issues: • What is worth preserving? • How to preserve so much? • How to access preserved data? • How to create incentives to preserve? http://arstechnica.com/business/consumerization-of-it/2011/09/information-explosion-how-rapidly-expanding-storage-spurs-innovation.ars 07.11.2011 2
SCAPE Digital Preservation• Standards, best-practices, and technologies utilized in order to ensure access to digital information over time• How long? “Digital documents last forever – or five years, whichever comes first.” http://www.clir.org/pubs/reports/rothenberg/introduction.html• Generally we mean decades or centuries 07.11.2011 3
SCAPE SCAPE – what is it about?• Planning and managing computing-intensive (digital) preservation processes such as the large-scale ingestion or migration of large (multi-Terabyte) data sets SCAPE is a follow-up to the highly successful FP6 IP Planets.
SCAPE SCAPE Project Data• Project instrument: FP7 Integrated Project• 6. Call • Objective ICT-2009.4.1: Digital Libraries and Digital Preservation • Target outcome (a) Scalable systems and services for preserving digital content• Duration: 42 months • February 2011 – July 2014• Budget: 11.3 Million Euro • Funded: 8.6 Million Euro
SCAPE SCAPE Consortium Number Partner name Partner short name Country1 (coordinator) AIT Austrian Institute of Technology GmbH AIT AT 2 British Library BL UK 3 Internet Memory Foundation IMF NL 4 Ex Libris Ltd EXL IL 5 Fachinformationszentrum Karlsruhe FIZ DE 6 Koninklijke Bibliotheek KB NL 7 KEEP Solutions KEEPS PT 8 Microsoft Research MSR UK 9 Österreichische Nationalbibliothek ONB AT 10 Open Planets Foundation OPF UK 11 Statsbiblioteket Aarhus SB DK 12 Science and Technology Facilities Council STFC UK 13 Technische Universität Berlin TUB DE 14 Technische Universität Wien TUW AT 15 University of Manchester UNIMAN UK 16 Pierre & Marie Curie Université Paris 6 UPMC FR
SCAPE SCAPE Project OverviewSCAPE will enhance the state of the art in digital preservation in three ways:• Infrastructure and tools for scalable preservation actions• A framework for automated, quality-assured preservation workflows• Integration of these components with policy-based automatedpreservation planning and watch Takeup Stakeholders Communities Dissemination Training Activities SustainabilitySCAPE results will be validated in three large-scale testbeds:• Digital Repositories Testbeds• Web Content Corpora Integration• Research Data Sets Benchmarking ValidationThe SCAPE Consortium brings together Cross-project Activities Project Managementa broad spectrum of expertise from Platform Technical Coordination Research Roadmap• Memory institutions Automation Workflows• Data centres Planning and Watch Parallelization Preservation Components Virtualization• Research labs Quality Assurance Institutional Policies Scalable Components• Universities Technical Watch Automated Planning Automation-ready Tools• Industrial firms 7
SCAPE Selected SCAPE Testbed Scenarios• Characterise large video files • The master MPEG2 files are so large that it is difficult to apply JHOVE and insufficient detail is provided. A detailed characterisation of the MPEG2 streams is needed in order to identify technical dependencies for extracting from or rendering the MPEG2 stream. This would enable preservation risks related to current access services to be monitored and action taken as necessary to ensure continued access and preservation.• Carry out large scale migrations • Migrating from one format to another introduces the possibility of damaging the content or failing to capture significant properties of the original in the resulting destination format. • Specific requirements include: • Solution tools that operate reliably at scale (80TB, 2 million pages) • Automated QA, ideally with no manual intervention on a file by file basis • QA performed by independent process from the migration process from digitalbevaring.dk • QA demonstrates strong evidence of significant properties being captured in the destination format• Quality assurance in web harvesting • For large scale crawls, automation of the quality control processes is a necessary requirement. Currently, this process relies on random sampling and very basic quantitative checks. 8
SCAPE Selected SCAPE Challenges• Bridging the gap between test workflows and scalable workflows• Applying Map/Reduce to binary data• Locality of data • Bring the data to the computation, or bring the computation to the data?• Repository Integration • Repository Consistency • Scalable Ingest• Preservation Planning • How to scale? • How to automate?• Research data sets from digitalbevaring.dk • How to preserve contextual information? 9
SCAPE SCAPE Solutions• SCAPE Platform • HADOOP, Stratosphere • Virtualized cluster • Repository integration • HBASE, HDFS - Fedora • Three levels of parallelization from digitalbevaring.dk • Distribution of files • Splitting binary files • Parallelisation of algorithms • Mapping Taverna to HADOOP 10
SCAPE SCAPE Solutions• Automated Planning and Watch • Building on the Planets PLATO tool • Automated watch based on • Results Evaluation Framework (REF) database • Monitoring trends in web harvests • Automated planning based on semantically formalized policies• Automated Quality Assurance • QA in web harvesting through automated comparison of rendered pages – combined structural and image analysis 11
SCAPE SCAPE Achievements• Public Website • http://www.scape-project.eu/• Development Infrastructure • Hosted by the Open Planets Foundation and GitHub • Development Wiki • http://wiki.opf-labs.org/display/SP/Home• Deliverables • First Deliverables available for download• Publications • 13 in the first nine months, including 6 at iPres next week • Report: comparative analysis of identification tools• Platform • 10-node, 20 TB experimental cluster hosted by AIT 12
SCAPE SCAPE Contact Information• http://www.scape-project.eu/• firstname.lastname@example.org• Dr. Ross King AIT Austrian Institute of Technology GmbH Donau-City-Strasse 1 A-1220 Wien 13