Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IMPACT HPC Cloud Day

279 views

Published on

Scalable and sustainable - OCR & document image analysis in the cloud
New Trends in Humanities Computing. HPC Cloud day, 4 October 2011, Amsterdam, Netherlands.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

IMPACT HPC Cloud Day

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. A sustainable infrastructure for large scale document image analysis HPC Cloud day – 4 October
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Background  IMPACT – Improving Access to Text (2008 – 2011) Large-scale integrating research project, funded by the EC – Consortium of 26 partners – Coordinated by the National Library of the Netherlands (KB) – EU funding: € 12 100 000 (FP7 ICT Work Programme) – From 2012: sustainable Centre of Competence with alternative 2 resources  Main objectives: - Innovate OCR technology - Capacity building in mass-digitisation
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… VVt Venetien den 1.Junij, Anno 1618. DJgn i f paffato te S' aö'Jifeert mo?üen/bah .)etgi'uotbciraetail)i.r/JtmelchontDecht te / sbnbe bele btr felbrr geiufttceert baer bnber eeniglje jprant o^fen/bie ftcb .met beSpaenfcbeu enbeeemgljen bifet Cbeiiupcen berbonbru befe 3
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… • I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… • II. Language challenges (spelling variants, inflection, and many more!) Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 6 IMPACT Solutions  From a technical perspective: > 20 software toolkits for solving different problems  Such as: OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc.  IMPACT Interoperability Framework (IIF)
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 7 Architecture  IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Maven - Apache Tomcat - Apache Axis2 - Apache Synapse - Taverna Workflow Engine  IMPACT Interoperability Framework: Dataset - PHP/mySQL database, frontend for search - approx. 5 TB raw data (images, text files, metadata) and growing
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. How does it work? 1. Digitisation/OCR challenges registered and tagged in database 8  Warped text 2. Database contains 99,95% correct result: “ground truth”
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. How does it work? 3. Researcher develops new method to tackle a problem 4. Research prototype is wrapped to a SOAP web service 9
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 10 How does it work? 5. Web service is integrated as a workflow module 6. Workflow module can be evaluated, based on the ground truth
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 11 Current setup  Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes (= server with all services installed)  Main effect: Process parallelization, Load distribution, Fail over  Drawback: Data is sent to worker nodes all around Europe = huge amount of data needs to be sent over the net!
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Proposed setup Set up worker nodes on the HPC cloud (same location) Advantage: - Improve speed and availability for concurrent users - Remove constraints for large-scale processing 12
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Benefits  Scalable platform  Availability of resources to a large number of users  Enable research into scalable computing  Consolidation of support and maintenance  Various interfaces (web/local) 13

×