IMPACT Interoperability and  Evaluation Framework Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, Biblioteca Nacional de España
OCR: A multitude of challenges… I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
OCR: A multitude of challenges… II. Language challenges (spelling variants, inflection, and many more!) Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
And a multitude of solutions! 22 different ‘tools’ from diverse developers: OCR (C++, C#),  Image Processing & Lexica (DLL),  Command Line Tools (Win/Linux),  Java, Ruby, PHP, Perl, etc.  + 3 rd  party software! “ One ring to rule them all...” ->  IMPACT Interoperability Framework (IIF)
Main   requirements Behavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability Functional: Modular Transparent Expandable Open source Platform independent
Architecture IMPACT Interoperability Framework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Ant/Maven - Apache Tomcat/httpd - Apache Axis2 - Apache Synapse - Taverna Workflow Engine IMPACT Evaluation Framework: Dataset - approx. 5 TB raw data (images, text files, metadata) and growing - Ground truth transcriptions - Evaluation modules
Components I: IIF Enterprise Service Bus receives (SOAP) requests  from users and distributes  the load to the available worker nodes Main effect:  Process parallelization, Load distribution, Fail over
Framework integration Easy to use generic command line wrapper (open source)
Workflow development OCR workflow =  data pipeline Building blocks =  processing steps (nodes) Integration =  interaction between nodes (mashup)
Workflow management Web 2.0 style registry: myExperiment Local client: Taverna Workbench Web client: project website API: SOAP/REST
Community Web2.0 style workflow registry Community of experts Sharing of resources Knowledge exchange A central meeting point  for users and researchers
Components II: Dataset Database and front end, hosted at the PRIMA  research group at University of Salford,  School of Computing, United Kingdom - more than 500.000 images from Digital Libraries - more than 50.000 ground truth representations - up to 10.000 direct access calls per month - 4 TB of space and growing
Dataset Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities
Evaluation features Text based comparison of result with ground truth,  using Levenshtein distance method Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework Example:
The PAGE Format Framework Two-level architecture: root structure task specific sub-formats Separate XML Schema definitions Format identification  via Namespaces Mapping of dependencies process chains alternative processing steps Linking via IDs Processing results or ground truth (e.g. binarisation, dewarping, page content)
Ground-Truthing Tools Aletheia FineReader PAGE Exporter GT Validator GT Normalizer
Profile ‘Full Text Recognition’ Evaluation for general text recognition Measure Weights Region Type Weights Merge Text Allowable Merge Image Split Graphic Allowable Split Chart Miss Table Partial Miss Separator Misclassification Maths False Detection Noise 1.5 1.0 2.0 2.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5
Partial   Miss Miss Merge Measures – Segmentation Errors Split Ground Truth Segmentation Result Mis-classi-fication Paragraph Caption
OCR Accuracy
Thank you! Questions?

Bne impact iif

  • 1.
    IMPACT Interoperability and Evaluation Framework Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, Biblioteca Nacional de España
  • 2.
    OCR: A multitudeof challenges… I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
  • 3.
    OCR: A multitudeof challenges… II. Language challenges (spelling variants, inflection, and many more!) Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
  • 4.
    And a multitudeof solutions! 22 different ‘tools’ from diverse developers: OCR (C++, C#), Image Processing & Lexica (DLL), Command Line Tools (Win/Linux), Java, Ruby, PHP, Perl, etc. + 3 rd party software! “ One ring to rule them all...” -> IMPACT Interoperability Framework (IIF)
  • 5.
    Main requirements Behavioural: Minimize integration effort Minimize deployment effort Maximize usability Maximize scalability Functional: Modular Transparent Expandable Open source Platform independent
  • 6.
    Architecture IMPACT InteroperabilityFramework: Technologies - Java 6 - Generic Web Service Wrapper - Apache Ant/Maven - Apache Tomcat/httpd - Apache Axis2 - Apache Synapse - Taverna Workflow Engine IMPACT Evaluation Framework: Dataset - approx. 5 TB raw data (images, text files, metadata) and growing - Ground truth transcriptions - Evaluation modules
  • 7.
    Components I: IIFEnterprise Service Bus receives (SOAP) requests from users and distributes the load to the available worker nodes Main effect: Process parallelization, Load distribution, Fail over
  • 8.
    Framework integration Easyto use generic command line wrapper (open source)
  • 9.
    Workflow development OCRworkflow = data pipeline Building blocks = processing steps (nodes) Integration = interaction between nodes (mashup)
  • 10.
    Workflow management Web2.0 style registry: myExperiment Local client: Taverna Workbench Web client: project website API: SOAP/REST
  • 11.
    Community Web2.0 styleworkflow registry Community of experts Sharing of resources Knowledge exchange A central meeting point for users and researchers
  • 12.
    Components II: DatasetDatabase and front end, hosted at the PRIMA research group at University of Salford, School of Computing, United Kingdom - more than 500.000 images from Digital Libraries - more than 50.000 ground truth representations - up to 10.000 direct access calls per month - 4 TB of space and growing
  • 13.
    Dataset Access toa representative and annotated dataset of significant size, with metadata, ground truth and search facilities
  • 14.
    Evaluation features Textbased comparison of result with ground truth, using Levenshtein distance method Layout based comparison of result with ground truth, using the Page Analysis And Ground Truth Elements Framework Example:
  • 15.
    The PAGE FormatFramework Two-level architecture: root structure task specific sub-formats Separate XML Schema definitions Format identification via Namespaces Mapping of dependencies process chains alternative processing steps Linking via IDs Processing results or ground truth (e.g. binarisation, dewarping, page content)
  • 16.
    Ground-Truthing Tools AletheiaFineReader PAGE Exporter GT Validator GT Normalizer
  • 17.
    Profile ‘Full TextRecognition’ Evaluation for general text recognition Measure Weights Region Type Weights Merge Text Allowable Merge Image Split Graphic Allowable Split Chart Miss Table Partial Miss Separator Misclassification Maths False Detection Noise 1.5 1.0 2.0 2.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5
  • 18.
    Partial Miss Miss Merge Measures – Segmentation Errors Split Ground Truth Segmentation Result Mis-classi-fication Paragraph Caption
  • 19.
  • 20.