IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
A sustainable infrastructure for 
large scale document image analysis 
HPC Cloud day – 4 October
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Background 
 IMPACT – Improving Access to Text (2008 – 2011) 
Large-scale integrating research project, funded by the EC 
– Consortium of 26 partners 
– Coordinated by the National Library of the Netherlands (KB) 
– EU funding: € 12 100 000 (FP7 ICT Work Programme) 
– From 2012: sustainable Centre of Competence with alternative 
2 
resources 
 Main objectives: 
- Innovate OCR technology 
- Capacity building in mass-digitisation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
OCR: A multitude of challenges… 
VVt Venetien den 1.Junij, Anno 1618. 
DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / 
sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met 
beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe 
3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
OCR: A multitude of challenges… 
• I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
OCR: A multitude of challenges… 
• II. Language challenges (spelling variants, inflection, and many 
more!) 
Example: historical variants of the Dutch word ‘wereld’ (world): 
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt 
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels 
zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts 
werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts 
werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
6 
IMPACT Solutions 
 From a technical perspective: 
> 20 software toolkits for solving different problems 
 Such as: 
OCR (C++, C#), 
Image Processing & Lexica (DLL), 
Command Line Tools (Win/Linux), 
Java, Ruby, PHP, Perl, etc. 
 IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
7 
Architecture 
 IMPACT Interoperability Framework: Technologies 
- Java 6 
- Generic Web Service Wrapper 
- Apache Maven 
- Apache Tomcat 
- Apache Axis2 
- Apache Synapse 
- Taverna Workflow Engine 
 IMPACT Interoperability Framework: Dataset 
- PHP/mySQL database, frontend for search 
- approx. 5 TB raw data (images, text files, metadata) and growing
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
How does it work? 
1. Digitisation/OCR challenges registered and tagged in database 
8 
 Warped text 
2. Database contains 99,95% correct result: “ground truth”
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
How does it work? 
3. Researcher develops new method to tackle a problem 
4. Research prototype is wrapped to a SOAP web service 
9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
10 
How does it work? 
5. Web service is integrated as a workflow module 
6. Workflow module can be evaluated, based on the ground truth
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
11 
Current setup 
 Enterprise Service Bus 
receives requests from 
users and distributes 
the load to the available 
worker nodes (= server 
with all services installed) 
 Main effect: 
Process parallelization, 
Load distribution, 
Fail over 
 Drawback: 
Data is sent to worker nodes all around Europe = 
huge amount of data needs to be sent over the net!
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Proposed setup 
Set up worker nodes on the HPC cloud (same location) 
Advantage: 
- Improve speed and availability for concurrent users 
- Remove constraints for large-scale processing 
12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 
Benefits 
 Scalable platform 
 Availability of resources to a large number of users 
 Enable research into scalable computing 
 Consolidation of support and maintenance 
 Various interfaces (web/local) 
13

IMPACT HPC Cloud Day

  • 1.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. A sustainable infrastructure for large scale document image analysis HPC Cloud day – 4 October
  • 2.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Background  IMPACT – Improving Access to Text (2008 – 2011) Large-scale integrating research project, funded by the EC – Consortium of 26 partners – Coordinated by the National Library of the Netherlands (KB) – EU funding: € 12 100 000 (FP7 ICT Work Programme) – From 2012: sustainable Centre of Competence with alternative 2 resources  Main objectives: - Innovate OCR technology - Capacity building in mass-digitisation
  • 3.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe 3
  • 4.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… • I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
  • 5.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… • II. Language challenges (spelling variants, inflection, and many more!) Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
  • 6.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 IMPACT Solutions  From a technical perspective: > 20 software toolkits for solving different problems  Such as: OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc.  IMPACT Interoperability Framework (IIF)
  • 7.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 Architecture  IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Maven - Apache Tomcat - Apache Axis2 - Apache Synapse - Taverna Workflow Engine  IMPACT Interoperability Framework: Dataset - PHP/mySQL database, frontend for search - approx. 5 TB raw data (images, text files, metadata) and growing
  • 8.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. How does it work? 1. Digitisation/OCR challenges registered and tagged in database 8  Warped text 2. Database contains 99,95% correct result: “ground truth”
  • 9.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. How does it work? 3. Researcher develops new method to tackle a problem 4. Research prototype is wrapped to a SOAP web service 9
  • 10.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 How does it work? 5. Web service is integrated as a workflow module 6. Workflow module can be evaluated, based on the ground truth
  • 11.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11 Current setup  Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes (= server with all services installed)  Main effect: Process parallelization, Load distribution, Fail over  Drawback: Data is sent to worker nodes all around Europe = huge amount of data needs to be sent over the net!
  • 12.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Proposed setup Set up worker nodes on the HPC cloud (same location) Advantage: - Improve speed and availability for concurrent users - Remove constraints for large-scale processing 12
  • 13.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits  Scalable platform  Availability of resources to a large number of users  Enable research into scalable computing  Consolidation of support and maintenance  Various interfaces (web/local) 13

Editor's Notes

  • #4 DDD: Krantentitel:   Courante uyt Italien, Duytslandt, &c .Datum, editie:   14-06-1618, Dag Uitgever:   Caspar van Hilten Plaats van Uitgave:   Amsterdam. Actual selection from online database
  • #5 <number>
  • #6 <number>