Long journey of Ruby standard library at RubyConf AU 2024
IMPACT at OCR Summit
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
An Experimental Workflow Development
Platform for Historical Document Digitisation
Clemens Neudecker, KB National Library of the Netherlands
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Background
IMPACT – Improving Access to Text (2008 – 2011)
From a technical perspective:
> 20 software components for solving specific issues
Prototyping new algorithms, improving commercial solutions
Different frameworks (C, C++, Java, etc.), platforms (Win/Linux)
+ 3rd party applications
“One ring to rule them all…”
IMPACT Interoperability Framework (IIF)
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Main requirements
Behavioural:
Minimize integration effort
Minimize deployment effort
Maximize usability
Maximize scalability
Functional:
Modular
Transparent
Expandable
Open source
Platform independent
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Framework integration
Simple to use generic command line wrapper for web services
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Architecture
IMPACT Interoperability Framework: Technologies
- Java
- Apache Maven
- Apache Tomcat
- Apache Axis2+Synapse
- Taverna Workflow Engine
IMPACT Interoperability Framework: Dataset
- more than 600.000 images from digital libraries
- more than 50.000 ground truth transcriptions
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Generic Web Service Wrapper
Only requirement: Command Line Application HTML form
Source code available on github:
https://github.com/impactcentre/toolwrapper
Easy integration: developers can focus on their application
and have to worry less about integration =
higher quality software components
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflows
OCR workflow =
data pipeline
Building blocks =
processing modules
Integration =
interaction between
nodes (mashups)
Collaboration with
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Workflow management
Web 2.0 style registry: myExperiment
Local client: Taverna Workbench
Web client: Project website
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Local client: Taverna Workbench
Background:
BioSciences
Developed and
maintained by
myGrid, UK
Available for
Windows/Linux/OSX
and as open source
(Java)
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Web client: Taverna Server/
Workflow Parser
SOAP/REST API
Remote execution of workflows
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Community
Web2.0 style workflow registry
Community of experts
Sharing of resources
Knowledge exchange
A central meeting point
for users and researchers
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Compute cluster
Enterprise Service Bus
receives requests from
users and distributes
the load to the available
worker nodes
Main effect:
Process parallelization,
Load distribution,
Fail over
Test deployment on Dutch Supercomputing Cloud HPC
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dataset
Representative and annotated dataset of significant size, with
metadata, ground truth and search facilities
15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation features
Text based comparison of result with ground truth,
using Levenshtein distance method
Layout based comparison of result with ground truth,
using the Page Analysis And Ground Truth Elements Framework
Example:
16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Outlook
Online service for testing/evaluation/processing
Results Repository (WebDAV, POI)
Extending the scope:
Workflows for linguistic analysis: CLARIN
Workflows for preservation: SCAPE
Even better scalability: MapReduce/Hadoop
Supported by a community of developers & practitioners
17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Summary
- Availability of resources (images, ground truth and tools)
to the international research community
- A common baseline for transparent evaluation and comparison
- Ready-to-use components, reproducible experiments
- Sharing of results and know-how
- Enable scalability for prototypes/data intensive workflows
- Simple and uniform user interface for all embedded tools
- Consolidation of support and maintenance
Thank you!
Questions?