Evolving Domains, Problems and Solutions for
       Long Term Digital Preservation

                      Dr. Ross King
         AIT Austrian Institute of Technology GmbH
Co-Authors
•   Orit Edelstein – IBM Research, Haifa
•   Michael Factor – IBM Research, Haifa
•   Thomas Risse – L3S Research Center, Hannover
•   Eliot Salant – IBM Research, Haifa
•   Philip Taylor – SAP Research, Belfast
Outline
• Why these projects?
• Introducing the projects
• Comparing and contrasting the projects
  – Motivation
  – Objectives
  – Approach
• Trends in Digital Preservation
Why these projects?
Timeline of Digital Preservation Projects




from http://cordis.europa.eu/fp7/ict/telearn-digicult/report-research-digital-preservation_en.pdf


Coordinated Action                  Network of Excellence                   STREP               Collaborative Project



                          FP7 6th Call, Objective ICT-2009.4.1:
                      Digital Libraries and Digital Preservation

                                       5                  07.11.2011
EU Funding for Digital Preservation Projects
            from http://cordis.europa.eu/fp7/ict/telearn-digicult/report-research-digital-preservation_en.pdf




              FP7                                                             FP6                FP5
            68.4 M€                                                         24.9 M€             0.9 M€



        6          07.11.2011
Introducing the projects
ARCOMEM
•   Transforming Web archives into community memories that are much more
    tightly integrated with their community of current and future users.
•   Developing methods and tools based on novel socially-aware and socially-
    driven Web preservation models.
•   Three dimensions
     –   Social Web analysis: leverage Social Web information, relying on the Wisdom of the
         Crowds for intelligent content appraisal, selection, contextualization and preservation.
     –   Archive enrichment: extract information about entities, events, topics, and opinions.
     –   Intelligent and collaborative content acquisition support for archives


•   Two testbeds
     –   Media-related web archives
         (Sudwestrundfunk, Deutsche Welle)
     –   Political archives
         (Helenic and Austrian Parliaments)
ENSURE
Enabling kNowledge Sustainability, Usability and Recovery for Economic value
• EVALUATE Cost and Value
      •   Ability to compose different quality solutions at different costs
      •   Build a software stack that balances the cost of preservation against the value of the data
•   AUTOMATE Preservation Lifecycle
      •   Control the preservation lifecycle based on
            • the changing value of business data over time
            • changes in regulation
            • advances in underlying technology
•   PROTECT



                                                                     4 3
      •   Content-aware data protection
            • Focus on long term access control, privacy and IPR,
              and de-identification
                                                                                                        Healthcare
•   SCALE using ICT innovations
      •   Investigate economical and scalable solutions             INNOVATIONS      USE CASES             Clinical Studies

          such as cloud storage
                                                                                                        Financial Services
            • include issues of security and data locality
•   Three testbeds
      •   Healthcare
      •   Clinical Trials
      •   Financial Services
SCAPE
SCAlable Preservation Environments
• Making preservation planning and preservation
  workflows scalable
   – Define and test an infrastructure for scalable
     preservation actions
   – Provide a framework for automated quality assurance
     workflows
   – Develop a policy-based preservation planning tool with
     automated preservation watch

• Three testbeds
   – Web archives
   – Large-scale repositories
   – Research data sets

                                                      from digitalbevaring.dk
TIMBUS
Timeless Business Processes and Services
• Exploring scenarios where the important digital information to be preserved is the
   execution context within which data are processed, analysed, transformed and
   rendered.
     –   Although there are significant advantages to SaaS and IoS models, there is the danger of services and
         service providers disappearing (for various reasons), leaving partially complete business processes.
•   Enlarging the understanding of digital preservation to include the set of activities,
    processes and tools that ensure continued access to services and software necessary
    to produce the context within which information can be accessed, properly rendered,
    validated and transformed into context based knowledge.
•   Three testbeds
     – engineering services and systems
       for digital preservation
     – civil engineering infrastructures
     – e-science and mathematical simulations
Comparing and contrasting
      the projects
Motivation
• ACROMEM is unique in dealing with publically available and non-regulated
  data and in harnessing the "wisdom of crowds" to help decide what to
  preserve.
• TIMBUS focuses on the environments that produce the data rather than
  the data itself.
• ENSURE and TIMBUS are motivated in part by accurate risk assessment
  and preservation lifecycle issues related to regulations.
• ENSURE, SCAPE and TIMBUS address the scalability of technology and
  software infrastructure for digital preservation.

• Targeted Stakeholders:
    –   scientific data (SCAPE, ENSURE, TIMBUS)
    –   memory institutions (SCAPE, ACROMEM)
    –   web (SCAPE, ACROMEM)
    –   engineering (TIMBUS)
    –   health care (ENSURE)
    –   finance (ENSURE)
Objectives
• ENSURE, SCAPE, and TIMBUS are focused on organisations (organization-
  focused projects); ARCOMEM is focused on the web
• All project address the question "what is to be preserved"
    –   ARCOMEM: social media can tell us
    –   ENSURE: extract this information from business rules
    –   SCAPE and TIMBUS: provide tools for responsible persons (curators)
    –   TIMBUS driven by risk management, ENSURE by cost/benefit
• ARCOMEM, ENSURE and SCAPE focus on issues of scalability
    – ARCOMEM, SCAPE: computational
    – ENSURE: storage infrastructure
• The organisation-focused projects also consider
    – the automation of the preservation lifecycle
    – the automation of quality assurance for preservation actions
• Both ENSURE and TIMBUS have the goal of re-running software after long
  periods of time
Approach
•   All four projects will produce prototype software frameworks
     –   The organisation-focused projects all propose to implement platforms for the execution of
         preservation workflows
•   SCAPE and ENSURE will make use of service-oriented architectures
     –   SCAPE for prototyping only; SOA model workflows should be translated in to Map/Reduce jobs
•   Digital Lifecycle approach
     –   TIMBUS focuses on the legal and IPR aspects
     –   ENSURE focuses on the trade-offs between quality, cost and economic performance
•   Preservation planning plays a role in all projects
     –   ENSURE plans a configuration layer with special emphasis on cost versus value
     –   The TIMBUS approach is based on dependency and risk management
     –   Both ARCOMEM and SCAPE rely on the internet to guide preservation
           •   ARCOMEM through the monitoring of social media
           •   SCAPE through the monitoring of web harvests

•   Virtualisation plays a role in all organisation-focused projects
     –   ENSURE: as a means to access digital objects
     –   SCAPE: as a means to deploy complex preservation action environments
     –   TIMBUS: as a means to preserve and recover the entire business process
Some trends
in Digital Preservation
Trends in Digital Preservation Projects
2006               2007                    2008                 2009              2010                 2011                2012


                CONTENT-DRIVEN


 Semantic                 Semantic
 Web Services             Web Services +
                          Agents                    EMULATION                                                        Virtualization



 PANIC                                                                                    Workflow
                                                               Linked Open Data
       SEMANTIC WEB
                                                                                     WORKFLOW

                              SOA: Web Services


                             WEB SERVICES
                                                                                                              Security and Trust
                                                                                         Distributed
                                                                                         Storage                      Quality Assurance
                                    GRID

                                                                                         Distributed
                            Distributed                                                  Processing
                            Storage
                                                                                                 CLOUD
                               17                 07.11.2011
Thank you for your attention!
            Ross King – AIT, Vienna
     Orit Edelstein – IBM Research, Haifa
     Michael Factor – IBM Research, Haifa
 Thomas Risse – L3S Research Center, Hannover
      Eliot Salant – IBM Research, Haifa
     Philip Taylor – SAP Research, Belfast

      ARCOMEM:     www.arcomem.eu
      ENSURE:      ensure-fp7.eu
      SCAPE:       www.scape-project.eu
      TIMBUS:      timbusproject.net

Evolving Domains, Problems and Solutions for Long Term Digital Preservation

  • 1.
    Evolving Domains, Problemsand Solutions for Long Term Digital Preservation Dr. Ross King AIT Austrian Institute of Technology GmbH
  • 2.
    Co-Authors • Orit Edelstein – IBM Research, Haifa • Michael Factor – IBM Research, Haifa • Thomas Risse – L3S Research Center, Hannover • Eliot Salant – IBM Research, Haifa • Philip Taylor – SAP Research, Belfast
  • 3.
    Outline • Why theseprojects? • Introducing the projects • Comparing and contrasting the projects – Motivation – Objectives – Approach • Trends in Digital Preservation
  • 4.
  • 5.
    Timeline of DigitalPreservation Projects from http://cordis.europa.eu/fp7/ict/telearn-digicult/report-research-digital-preservation_en.pdf Coordinated Action Network of Excellence STREP Collaborative Project FP7 6th Call, Objective ICT-2009.4.1: Digital Libraries and Digital Preservation 5 07.11.2011
  • 6.
    EU Funding forDigital Preservation Projects from http://cordis.europa.eu/fp7/ict/telearn-digicult/report-research-digital-preservation_en.pdf FP7 FP6 FP5 68.4 M€ 24.9 M€ 0.9 M€ 6 07.11.2011
  • 7.
  • 8.
    ARCOMEM • Transforming Web archives into community memories that are much more tightly integrated with their community of current and future users. • Developing methods and tools based on novel socially-aware and socially- driven Web preservation models. • Three dimensions – Social Web analysis: leverage Social Web information, relying on the Wisdom of the Crowds for intelligent content appraisal, selection, contextualization and preservation. – Archive enrichment: extract information about entities, events, topics, and opinions. – Intelligent and collaborative content acquisition support for archives • Two testbeds – Media-related web archives (Sudwestrundfunk, Deutsche Welle) – Political archives (Helenic and Austrian Parliaments)
  • 9.
    ENSURE Enabling kNowledge Sustainability,Usability and Recovery for Economic value • EVALUATE Cost and Value • Ability to compose different quality solutions at different costs • Build a software stack that balances the cost of preservation against the value of the data • AUTOMATE Preservation Lifecycle • Control the preservation lifecycle based on • the changing value of business data over time • changes in regulation • advances in underlying technology • PROTECT 4 3 • Content-aware data protection • Focus on long term access control, privacy and IPR, and de-identification Healthcare • SCALE using ICT innovations • Investigate economical and scalable solutions INNOVATIONS USE CASES Clinical Studies such as cloud storage Financial Services • include issues of security and data locality • Three testbeds • Healthcare • Clinical Trials • Financial Services
  • 10.
    SCAPE SCAlable Preservation Environments •Making preservation planning and preservation workflows scalable – Define and test an infrastructure for scalable preservation actions – Provide a framework for automated quality assurance workflows – Develop a policy-based preservation planning tool with automated preservation watch • Three testbeds – Web archives – Large-scale repositories – Research data sets from digitalbevaring.dk
  • 11.
    TIMBUS Timeless Business Processesand Services • Exploring scenarios where the important digital information to be preserved is the execution context within which data are processed, analysed, transformed and rendered. – Although there are significant advantages to SaaS and IoS models, there is the danger of services and service providers disappearing (for various reasons), leaving partially complete business processes. • Enlarging the understanding of digital preservation to include the set of activities, processes and tools that ensure continued access to services and software necessary to produce the context within which information can be accessed, properly rendered, validated and transformed into context based knowledge. • Three testbeds – engineering services and systems for digital preservation – civil engineering infrastructures – e-science and mathematical simulations
  • 12.
  • 13.
    Motivation • ACROMEM isunique in dealing with publically available and non-regulated data and in harnessing the "wisdom of crowds" to help decide what to preserve. • TIMBUS focuses on the environments that produce the data rather than the data itself. • ENSURE and TIMBUS are motivated in part by accurate risk assessment and preservation lifecycle issues related to regulations. • ENSURE, SCAPE and TIMBUS address the scalability of technology and software infrastructure for digital preservation. • Targeted Stakeholders: – scientific data (SCAPE, ENSURE, TIMBUS) – memory institutions (SCAPE, ACROMEM) – web (SCAPE, ACROMEM) – engineering (TIMBUS) – health care (ENSURE) – finance (ENSURE)
  • 14.
    Objectives • ENSURE, SCAPE,and TIMBUS are focused on organisations (organization- focused projects); ARCOMEM is focused on the web • All project address the question "what is to be preserved" – ARCOMEM: social media can tell us – ENSURE: extract this information from business rules – SCAPE and TIMBUS: provide tools for responsible persons (curators) – TIMBUS driven by risk management, ENSURE by cost/benefit • ARCOMEM, ENSURE and SCAPE focus on issues of scalability – ARCOMEM, SCAPE: computational – ENSURE: storage infrastructure • The organisation-focused projects also consider – the automation of the preservation lifecycle – the automation of quality assurance for preservation actions • Both ENSURE and TIMBUS have the goal of re-running software after long periods of time
  • 15.
    Approach • All four projects will produce prototype software frameworks – The organisation-focused projects all propose to implement platforms for the execution of preservation workflows • SCAPE and ENSURE will make use of service-oriented architectures – SCAPE for prototyping only; SOA model workflows should be translated in to Map/Reduce jobs • Digital Lifecycle approach – TIMBUS focuses on the legal and IPR aspects – ENSURE focuses on the trade-offs between quality, cost and economic performance • Preservation planning plays a role in all projects – ENSURE plans a configuration layer with special emphasis on cost versus value – The TIMBUS approach is based on dependency and risk management – Both ARCOMEM and SCAPE rely on the internet to guide preservation • ARCOMEM through the monitoring of social media • SCAPE through the monitoring of web harvests • Virtualisation plays a role in all organisation-focused projects – ENSURE: as a means to access digital objects – SCAPE: as a means to deploy complex preservation action environments – TIMBUS: as a means to preserve and recover the entire business process
  • 16.
  • 17.
    Trends in DigitalPreservation Projects 2006 2007 2008 2009 2010 2011 2012 CONTENT-DRIVEN Semantic Semantic Web Services Web Services + Agents EMULATION Virtualization PANIC Workflow Linked Open Data SEMANTIC WEB WORKFLOW SOA: Web Services WEB SERVICES Security and Trust Distributed Storage Quality Assurance GRID Distributed Distributed Processing Storage CLOUD 17 07.11.2011
  • 18.
    Thank you foryour attention! Ross King – AIT, Vienna Orit Edelstein – IBM Research, Haifa Michael Factor – IBM Research, Haifa Thomas Risse – L3S Research Center, Hannover Eliot Salant – IBM Research, Haifa Philip Taylor – SAP Research, Belfast ARCOMEM: www.arcomem.eu ENSURE: ensure-fp7.eu SCAPE: www.scape-project.eu TIMBUS: timbusproject.net