Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IMPACT at OCR Summit

258 views

Published on

OCR Summit Meeting
Initiative for Digital Humanities, Media and Culture, Texas A&M University, 17-18 October 2011, College Station, TX, United States.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

IMPACT at OCR Summit

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. An Experimental Workflow Development Platform for Historical Document Digitisation Clemens Neudecker, KB National Library of the Netherlands
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Background  IMPACT – Improving Access to Text (2008 – 2011) From a technical perspective: > 20 software components for solving specific issues Prototyping new algorithms, improving commercial solutions Different frameworks (C, C++, Java, etc.), platforms (Win/Linux) + 3rd party applications “One ring to rule them all…”  IMPACT Interoperability Framework (IIF)
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main requirements Behavioural:  Minimize integration effort  Minimize deployment effort  Maximize usability  Maximize scalability Functional:  Modular  Transparent  Expandable  Open source  Platform independent
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Framework integration  Simple to use generic command line wrapper for web services
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture  IMPACT Interoperability Framework: Technologies - Java - Apache Maven - Apache Tomcat - Apache Axis2+Synapse - Taverna Workflow Engine  IMPACT Interoperability Framework: Dataset - more than 600.000 images from digital libraries - more than 50.000 ground truth transcriptions
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Generic Web Service Wrapper Only requirement: Command Line Application  HTML form Source code available on github: https://github.com/impactcentre/toolwrapper  Easy integration: developers can focus on their application and have to worry less about integration = higher quality software components
  7. 7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflows  OCR workflow = data pipeline  Building blocks = processing modules  Integration = interaction between nodes (mashups)  Collaboration with
  8. 8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  9. 9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow management  Web 2.0 style registry: myExperiment  Local client: Taverna Workbench  Web client: Project website
  10. 10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Local client: Taverna Workbench  Background: BioSciences  Developed and maintained by myGrid, UK  Available for Windows/Linux/OSX and as open source (Java)
  11. 11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Web client: Taverna Server/ Workflow Parser  SOAP/REST API  Remote execution of workflows
  12. 12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Community  Web2.0 style workflow registry  Community of experts  Sharing of resources  Knowledge exchange  A central meeting point for users and researchers
  13. 13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Compute cluster  Enterprise Service Bus receives requests from users and distributes the load to the available worker nodes  Main effect: Process parallelization, Load distribution, Fail over  Test deployment on Dutch Supercomputing Cloud HPC
  14. 14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dataset  Representative and annotated dataset of significant size, with metadata, ground truth and search facilities
  15. 15. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation features  Text based comparison of result with ground truth, using Levenshtein distance method  Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework  Example:
  16. 16. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Outlook  Online service for testing/evaluation/processing Results Repository (WebDAV, POI)  Extending the scope: Workflows for linguistic analysis: CLARIN Workflows for preservation: SCAPE  Even better scalability: MapReduce/Hadoop  Supported by a community of developers & practitioners
  17. 17. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Summary - Availability of resources (images, ground truth and tools) to the international research community - A common baseline for transparent evaluation and comparison - Ready-to-use components, reproducible experiments - Sharing of results and know-how - Enable scalability for prototypes/data intensive workflows - Simple and uniform user interface for all embedded tools - Consolidation of support and maintenance Thank you! Questions?

×