IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT Interoperability and
Evaluation Framework
Clemens Neudecker, National Library of the Netherlands
IMPACT Demo Day, Biblioteca Nacional de España
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR: A multitude of challenges…
I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR: A multitude of challenges…
II. Language challenges (spelling variants, inflection, and many more!)




Example: historical variants of the Dutch word ‘wereld’ (world):
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt
wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels
zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts
werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts
werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




And a multitude of solutions!
        22 different ‘tools’ from diverse developers:
        OCR (C++, C#),
        Image Processing & Lexica (DLL),
        Command Line Tools (Win/Linux),
        Java, Ruby, PHP, Perl, etc.
        + 3rd party software!

        “One ring to rule them all...”

→ IMPACT Interoperability Framework (IIF)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Main requirements
Behavioural:
  Minimize integration effort
  Minimize deployment effort
  Maximize usability
  Maximize scalability

Functional:
  Modular
  Transparent
  Expandable
  Open source
  Platform independent
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Architecture
        IMPACT Interoperability Framework: Technologies
        - Java 6
        - Generic Web Service Wrapper
        - Apache Ant/Maven
        - Apache Tomcat/httpd
        - Apache Axis2
        - Apache Synapse
        - Taverna Workflow Engine

        IMPACT Evaluation Framework: Dataset
        - approx. 5 TB raw data (images, text files, metadata) and growing
        - Ground truth transcriptions
        - Evaluation modules
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Components I: IIF

        Enterprise Service Bus
        receives (SOAP) requests
        from users and distributes
        the load to the available
        worker nodes

        Main effect:
        Process parallelization,
        Load distribution,
        Fail over
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Framework integration
        Easy to use generic command line wrapper (open source)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Workflow development

                                                             OCR workflow =
                                                             data pipeline

                                                             Building blocks =
                                                             processing steps (nodes)

                                                             Integration =
                                                             interaction between nodes
                                                             (mashup)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Workflow management
       Web 2.0 style registry: myExperiment
       Local client: Taverna Workbench
       Web client: project website
       API: SOAP/REST
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Community
        Web2.0 style workflow registry

        Community of experts

        Sharing of resources

        Knowledge exchange

        A central meeting point
        for users and researchers
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Components II: Dataset
Database and front end, hosted at the PRIMA
research group at University of Salford,
School of Computing, United Kingdom

- more than 500.000 images from Digital Libraries
- more than 50.000 ground truth representations
- up to 10.000 direct access calls per month
- 4 TB of space and growing
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Dataset
        Access to a representative and annotated dataset of significant size,
        with metadata, ground truth and search facilities
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Evaluation features
        Text based comparison of result with ground truth,
        using Levenshtein distance method
        Layout based comparison of result with ground truth,
        using the Page Analysis And Ground Truth Elements Framework
        Example:
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Ground-Truthing Tools

 Aletheia

 FineReader
 PAGE Exporter

 GT Validator

 GT Normalizer




                                                                                                                                                         16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




    Measures – Segmentation Errors
                Miss                                                                                                                                           Partial Miss




Mis-
classi-                                                                                                                                                               Merge
fication
            Caption

      Paragraph




                                                                                                                                                                      Ground Truth
                                                                                                                                                                      Segmentation
                                                                                                                                                                      Result
                                                                                          Split
                                                                                                                                                                               18
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR Accuracy
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                       Thank you! Questions?

IMPACT Interoperability and Evaluation Framework. Clemens Neudecker

  • 1.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Interoperability and Evaluation Framework Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, Biblioteca Nacional de España
  • 2.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
  • 3.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR: A multitude of challenges… II. Language challenges (spelling variants, inflection, and many more!) Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
  • 4.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. And a multitude of solutions! 22 different ‘tools’ from diverse developers: OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3rd party software! “One ring to rule them all...” → IMPACT Interoperability Framework (IIF)
  • 5.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Main requirements Behavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability Functional: Modular Transparent Expandable Open source Platform independent
  • 6.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Architecture IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Ant/Maven - Apache Tomcat/httpd - Apache Axis2 - Apache Synapse - Taverna Workflow Engine IMPACT Evaluation Framework: Dataset - approx. 5 TB raw data (images, text files, metadata) and growing - Ground truth transcriptions - Evaluation modules
  • 7.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Components I: IIF Enterprise Service Bus receives (SOAP) requests from users and distributes the load to the available worker nodes Main effect: Process parallelization, Load distribution, Fail over
  • 8.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Framework integration Easy to use generic command line wrapper (open source)
  • 9.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow development OCR workflow = data pipeline Building blocks = processing steps (nodes) Integration = interaction between nodes (mashup)
  • 10.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Workflow management Web 2.0 style registry: myExperiment Local client: Taverna Workbench Web client: project website API: SOAP/REST
  • 11.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Community Web2.0 style workflow registry Community of experts Sharing of resources Knowledge exchange A central meeting point for users and researchers
  • 12.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Components II: Dataset Database and front end, hosted at the PRIMA research group at University of Salford, School of Computing, United Kingdom - more than 500.000 images from Digital Libraries - more than 50.000 ground truth representations - up to 10.000 direct access calls per month - 4 TB of space and growing
  • 13.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Dataset Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities
  • 14.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Evaluation features Text based comparison of result with ground truth, using Levenshtein distance method Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework Example:
  • 15.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 16.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Ground-Truthing Tools Aletheia FineReader PAGE Exporter GT Validator GT Normalizer 16
  • 17.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
  • 18.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Measures – Segmentation Errors Miss Partial Miss Mis- classi- Merge fication Caption Paragraph Ground Truth Segmentation Result Split 18
  • 19.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR Accuracy
  • 20.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you! Questions?