IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
An Experimental Workflow Development 
Platform for Historical Document Digitisation 
Clemens Neudecker, KB National Library of the Netherlands
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Background 
 IMPACT – Improving Access to Text (2008 – 2011) 
From a technical perspective: 
> 20 software components for solving specific issues 
Prototyping new algorithms, improving commercial solutions 
Different frameworks (C, C++, Java, etc.), platforms (Win/Linux) 
+ 3rd party applications 
“One ring to rule them all…” 
 IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Main requirements 
Behavioural: 
 Minimize integration effort 
 Minimize deployment effort 
 Maximize usability 
 Maximize scalability 
Functional: 
 Modular 
 Transparent 
 Expandable 
 Open source 
 Platform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Framework integration 
 Simple to use generic command line wrapper for web services
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Architecture 
 IMPACT Interoperability Framework: Technologies 
- Java 
- Apache Maven 
- Apache Tomcat 
- Apache Axis2+Synapse 
- Taverna Workflow Engine 
 IMPACT Interoperability Framework: Dataset 
- more than 600.000 images from digital libraries 
- more than 50.000 ground truth transcriptions
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Generic Web Service Wrapper 
Only requirement: Command Line Application  HTML form 
Source code available on github: 
https://github.com/impactcentre/toolwrapper 
 Easy integration: developers can focus on their application 
and have to worry less about integration = 
higher quality software components
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Workflows 
 OCR workflow = 
data pipeline 
 Building blocks = 
processing modules 
 Integration = 
interaction between 
nodes (mashups) 
 Collaboration with
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Workflow management 
 Web 2.0 style registry: myExperiment 
 Local client: Taverna Workbench 
 Web client: Project website
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Local client: Taverna Workbench 
 Background: 
BioSciences 
 Developed and 
maintained by 
myGrid, UK 
 Available for 
Windows/Linux/OSX 
and as open source 
(Java)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Web client: Taverna Server/ 
Workflow Parser 
 SOAP/REST API 
 Remote execution of workflows
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Community 
 Web2.0 style workflow registry 
 Community of experts 
 Sharing of resources 
 Knowledge exchange 
 A central meeting point 
for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Compute cluster 
 Enterprise Service Bus 
receives requests from 
users and distributes 
the load to the available 
worker nodes 
 Main effect: 
Process parallelization, 
Load distribution, 
Fail over 
 Test deployment on Dutch Supercomputing Cloud HPC
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Dataset 
 Representative and annotated dataset of significant size, with 
metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Evaluation features 
 Text based comparison of result with ground truth, 
using Levenshtein distance method 
 Layout based comparison of result with ground truth, 
using the Page Analysis And Ground Truth Elements Framework 
 Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Outlook 
 Online service for testing/evaluation/processing 
Results Repository (WebDAV, POI) 
 Extending the scope: 
Workflows for linguistic analysis: CLARIN 
Workflows for preservation: SCAPE 
 Even better scalability: MapReduce/Hadoop 
 Supported by a community of developers & practitioners
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Summary 
- Availability of resources (images, ground truth and tools) 
to the international research community 
- A common baseline for transparent evaluation and comparison 
- Ready-to-use components, reproducible experiments 
- Sharing of results and know-how 
- Enable scalability for prototypes/data intensive workflows 
- Simple and uniform user interface for all embedded tools 
- Consolidation of support and maintenance 
Thank you! 
Questions?

IMPACT at OCR Summit

  • 1.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. An Experimental Workflow Development Platform for Historical Document Digitisation Clemens Neudecker, KB National Library of the Netherlands
  • 2.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Background  IMPACT – Improving Access to Text (2008 – 2011) From a technical perspective: > 20 software components for solving specific issues Prototyping new algorithms, improving commercial solutions Different frameworks (C, C++, Java, etc.), platforms (Win/Linux) + 3rd party applications “One ring to rule them all…”  IMPACT Interoperability Framework (IIF)
  • 3.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main requirements Behavioural:  Minimize integration effort  Minimize deployment effort  Maximize usability  Maximize scalability Functional:  Modular  Transparent  Expandable  Open source  Platform independent
  • 4.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Framework integration  Simple to use generic command line wrapper for web services
  • 5.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture  IMPACT Interoperability Framework: Technologies - Java - Apache Maven - Apache Tomcat - Apache Axis2+Synapse - Taverna Workflow Engine  IMPACT Interoperability Framework: Dataset - more than 600.000 images from digital libraries - more than 50.000 ground truth transcriptions
  • 6.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Generic Web Service Wrapper Only requirement: Command Line Application  HTML form Source code available on github: https://github.com/impactcentre/toolwrapper  Easy integration: developers can focus on their application and have to worry less about integration = higher quality software components
  • 7.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflows  OCR workflow = data pipeline  Building blocks = processing modules  Integration = interaction between nodes (mashups)  Collaboration with
  • 8.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 9.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: Project website
  • 10.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local client: Taverna Workbench  Background: BioSciences  Developed and maintained by myGrid, UK  Available for Windows/Linux/OSX and as open source (Java)
  • 11.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Web client: Taverna Server/ Workflow Parser  SOAP/REST API  Remote execution of workflows
  • 12.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Community  Web2.0 style workflow registry  Community of experts  Sharing of resources  Knowledge exchange  A central meeting point for users and researchers
  • 13.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Compute cluster  Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes  Main effect: Process parallelization, Load distribution, Fail over  Test deployment on Dutch Supercomputing Cloud HPC
  • 14.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dataset  Representative and annotated dataset of significant size, with metadata, ground truth and search facilities
  • 15.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation features  Text based comparison of result with ground truth, using Levenshtein distance method  Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework  Example:
  • 16.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Outlook  Online service for testing/evaluation/processing Results Repository (WebDAV, POI)  Extending the scope: Workflows for linguistic analysis: CLARIN Workflows for preservation: SCAPE  Even better scalability: MapReduce/Hadoop  Supported by a community of developers & practitioners
  • 17.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary - Availability of resources (images, ground truth and tools) to the international research community - A common baseline for transparent evaluation and comparison - Ready-to-use components, reproducible experiments - Sharing of results and know-how - Enable scalability for prototypes/data intensive workflows - Simple and uniform user interface for all embedded tools - Consolidation of support and maintenance Thank you! Questions?