• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework
 

BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework

on

  • 1,596 views

Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.

Slides from Clemens Neudecker's presentation on the IMPACT Interoperability and Evaluation Framework within the IMPACT project at the British Library Demo-day on the 12th July 2011.

Statistics

Views

Total Views
1,596
Views on SlideShare
936
Embed Views
660

Actions

Likes
0
Downloads
5
Comments
0

8 Embeds 660

http://impactocr.wordpress.com 515
http://impact.sherrydesign.co.uk 95
http://www.digitisation.eu 34
http://impactcoc.sub.uni-goettingen.de 9
http://impact.neme.com 2
http://impact2.neme.com 2
http://impact4.neme.com 2
http://translate.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs LicenseCC Attribution-NonCommercial-NoDerivs License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework BL Demo Day - July2011 - (9) IMPACT Interoperability and Evaluation Framework Presentation Transcript

    • IMPACT Interoperability and Evaluation Framework Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, British Library 12/11/11
    • OCR: A multitude of challenges…
      • I. OCR challenges (gothic fonts, bleed-through, warping, etc.)
    • OCR: A multitude of challenges…
      • II. Language challenges (spelling variants, inflection, and many more!)
      Example: historical variants of the Dutch word ‘wereld’ (world): werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald we ë led
    • And a multitude of solutions!
      • 22 different ‘tools’ from diverse developers:
      • OCR (C++, C#),
      • Image Processing & Lexica (DLL),
      • Command Line Tools (Win/Linux),
      • Java, Ruby, PHP, Perl, etc.
      • + 3 rd party software!
      • “ One ring to rule them all...”
      •  IMPACT Interoperability Framework (IIF)
    • Main requirements
      • Behavioural:
      • Minimize integration effort
      • Minimize deployment effort
      • Maximize usability
      • Maximize scalability
      • Functional:
      • Modular
      • Transparent
      • Expandable
      • Open source
      • Platform independent
    • Architecture
      • IMPACT Interoperability Framework: Technologies
      • - Java 6
      • - Generic Web Service Wrapper
      • - Apache Ant/Maven
      • - Apache Tomcat/httpd
      • - Apache Axis2
      • - Apache Synapse
      • - Taverna Workflow Engine
      • IMPACT Evaluation Framework: Dataset
      • - approx. 5 TB raw data (images, text files, metadata) and growing
      • - Ground truth transcriptions
      • - Evaluation modules
    • Components I: IIF
      • Enterprise Service Bus receives (SOAP) requests from users and distributes the load to the available worker nodes
      • Main effect: Process parallelization,
      • Load distribution,
      • Fail over
    • Framework integration
      • Easy to use generic command line wrapper (open source)
    • Workflow development
      • OCR workflow = data pipeline
      • Building blocks =
      • processing steps (nodes)
      • Integration = interaction between nodes
      • (mashup)
    • Workflow management
      • Web 2.0 style registry: myExperiment
      • Local client: Taverna Workbench
      • Web client: project website
      • API: SOAP/REST
    • Community
      • Web2.0 style workflow registry
      • Community of experts
      • Sharing of resources
      • Knowledge exchange
      • A central meeting point for users and researchers
    • Components II: Dataset
      • Database and front end, hosted at the PRIMA
      • research group at University of Salford,
      • School of Computing, United Kingdom
      • - more than 500.000 images from Digital Libraries
      • - more than 50.000 ground truth representations
      • - up to 10.000 direct access calls per month
      • - 4 TB of space and growing
    • Dataset
      • Access to a representative and annotated dataset of significant size, with metadata, ground truth and search facilities
    • Evaluation features
      • Text based comparison of result with ground truth, using Levenshtein distance method
      • Layout based comparison of result with ground truth,
      • using the Page Analysis And Ground Truth Elements Framework
      • Example:
    • The PAGE Format Framework
      • Two-level architecture:
        • root structure
        • task specific sub-formats
      • Separate XML Schema definitions
      • Format identification via Namespaces
      • Mapping of
        • dependencies
        • process chains
        • alternative processing steps
      • Linking via IDs
      Processing results or ground truth (e.g. binarisation, dewarping, page content)
    • Ground-Truthing Tools
      • Aletheia
      • FineReader
      • PAGE Exporter
      • GT Validator
      • GT Normalizer
    • Profile ‘Full Text Recognition’
      • Evaluation for general text recognition
      Measure Weights Region Type Weights Merge Text Allowable Merge Image Split Graphic Allowable Split Chart Miss Table Partial Miss Separator Misclassification Maths False Detection Noise 1.5 1.0 2.0 2.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5
    • Partial Miss Miss Merge Measures – Segmentation Errors Split Ground Truth Segmentation Result Mis-classi-fication Paragraph Caption
    • OCR Accuracy
    • Thank you! Questions?