Experimental Workflow Development in Digitisation
2nd Qualitative and Quantitative Methods in Libraries International Conference (QQML2010), 25-28 May 2010, Chania, Greece.
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Experimental workflow development
in digitisation
The concept of collaborative workflow development in the IMPACT project
Mustafa Dogan (Göttingen State and University Library)
Clemens Neudecker (Koninklijke Bibliotheek)
Gerd Zechmeister (Austrian National Library)
Sven Schlarb (Austrian National Library)
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27.5.2010 QQML
2
Agenda
Background of IMPACT
Digitisation workflows
Collaborative workflow development
Architectural principles
Workflow development platform
Key success factors
Outlook and future scenarios
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27.5.2010 QQML
3
Background of IMPACT
Project partners
– 26 Libraries, Research Institutes and Industry Partners
Main objective
– Improve access to historical books and newspapers printed before 1900
Software tools and prototypes
– Image Enhancement & Segmentation Toolkit
– Improved ABBYY FineReader OCR Engine, IBM Adaptive OCR
– Post-processing and -correction modules
– Lexical resources for several European languages
Support to the MLA community
– Best Practises & Strategic/Operational Guidelines
– Online Helpdesk
– Tool Showcases & Demonstrators
– Centre of Competence
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Digitisation workflows
Digitisation: a sequence of steps, from selection of analogue source
material to presentation of digital objects for end-users
Workflow: software-based execution of a sequence without human
27.5.2010 QQML
4
interaction
Challenges and barriers
– Workflows are tailored to specific needs
– Lack of interoperability for applied software and input/outdata data
– Lack of collaboratively used and developed resources and expertise
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Collaborative workflow development
Workflow Development as a community-driven activity using an
27.5.2010 QQML
5
experimental platform
Scientific workflows: using web services representing individual
software modules (Shiyong Lu et al. 2009)
Providing highly innovative and efficient tools to a wider community to
design workflows
Technical staff providing the platform, conceptual/library staff
designing workflows
Using Web 2.0 features to share and expand knowledge and
resources
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27.5.2010 QQML
6
Architectural platform principles
Modularity
Transparency
Flexibility
Extensibility
Open standards based
Accessibility
Scalability
Collaboration
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27.5.2010 QQML
7
Workflow development platform
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27.5.2010 QQML
8
Workflow development phases
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation criteria
OCR: correctly recognised characters/words
Segmentation: correctly identified text and graphical regions
Workflows: comparing workflows and identifiying most suitable
Statistical and provenance data: e.g. processing time
27.5.2010 QQML
9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27.5.2010 QQML
10
Outlook
Keys to success
– Joint effort by library and software development staff
– Usability of tools and platform
– Incentive to collaborative work
– Testing and adaptation of workflows
– Permanently tailoring and optimizing workflows
Future work
– Demonstration of current (web) services
– Experimental platform as sustainable resource for a Centre of
Competence for the MLA community
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
27.5.2010 QQML
11
Thank you very much!
Contact:
Project Website: http://www.impact-project.eu
Project Office: impact@kb.nl
Editor's Notes
26 Libraries, Research Institutes and Industry Partners: providing content/material, knowledge/expertise, tools/software modules/prototypes
What is digitisation in IMPACT
How do we define a workflow in IMPACT
What are the challenges in current workflow development and application
Workflows are tailored to library-/project-specific needs
no out-of-the-box system
causes labour- and cost-intensive evaluation and adaptation for repurpose
Lack of interoperability for applied software and input/outdata data
Lack of collaboratively used and developed resources and expertise
Human intervention often required to guarantee ongoing processing
Concept of scientific workflows: http://www.cs.wayne.edu/~shiyong/papers/tsc09.pdf
Technical staff providing the platform, conceptual staff designing workflows no in-depth technical and procedural knowledge required by conceptual staff
Modularity:
modules combined in number of combinations
identify the most suitable processing chain
service-oriented-architecture (SOA) is the guiding architectural design principle
principle of loose coupling of reusable processing units
minimising interdependencies
Transparency:
Each processing step tested and evaluated separately
Flexibility:
platform-independent
capable of integrating different types of software
performance of tools can be compared easily.
Extensibility:
Third party components small extra effort
not restricted to software tools developed in IMPACT
Open standards based:
widely supported open source software (Apache Software Foundation)
Interoperability through use of XML standards such as
METS/ALTO for encoding of structural information and the OCR-recognised text
SOAP as the message exchange protocol
WSDL for web service description
Accessibility:
3 different types of interfaces
user-friendly, graphical workflow design and execution interface
a web client generator seamless integration into web sites
machine interface (API)
Scalability:
Components will be deployed in the IT infrastructure of different partner organisations in a distributed network with cloned services
Services available in a redundant way
Balancing the workload and adding additional computing capacity when needed.
Collaboration:
community-wide applicability
optimisation of workflows
accessible by various channels (including Web 2.0 features)
comprehensively described and documented.
Joint effort by library and software development staff: library: concepts, content-providing – SD: technical framework, integration of services etc.
Expanding portfolio of web services: also by scanning services, quality assurance/evaluation modules etc. to cover entire range of digitisation workflow steps