BL Demo Day - July2011 - (5) OCR for IMPACT Part 2


Published on

Niall Anderson outlines the IMPACT approach to adaptive OCR and Post Production including tools prepared by IBM CONCERT and experimental tools from: USAL, NCSR and UIBK.

Delivered at BL Demo Day - 12th July 2011

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

BL Demo Day - July2011 - (5) OCR for IMPACT Part 2

  1. 1. OCR and post-correction
  2. 2. Recap <ul><ul><li>OCR will produce its best results on material with the following characteristics </li></ul></ul><ul><ul><ul><li>The layout of the text is simple, with no tables or illustrations; </li></ul></ul></ul><ul><ul><ul><li>The text itself is in a modern, computer-generated typeface; </li></ul></ul></ul><ul><ul><ul><li>The digital image preserves a high contrast between the text block and non-text detail (including blank space) </li></ul></ul></ul><ul><ul><ul><li>The image has been created from a perfectly flat and straight scan (if a digital copy from an analogue source) </li></ul></ul></ul><ul><ul><ul><li>The text of the analogue source is clear, well aligned and consistently presented </li></ul></ul></ul><ul><ul><ul><li>The basic material of the analogue source is undamaged; the text is in a single language </li></ul></ul></ul><ul><ul><ul><li>The image has been taken from the original physical source and not a degraded surrogate (such as microfilm) </li></ul></ul></ul>
  3. 3. Context for Post-Correction <ul><li>CENL 2008 survey of its members: </li></ul><ul><ul><li>350% increase in digitisation of historic texts in five years </li></ul></ul><ul><ul><li>8 million digital items in existence </li></ul></ul><ul><ul><li>Era of mass digitisation of historical texts </li></ul></ul><ul><li>British Library experience </li></ul><ul><ul><li>OCR tools good but not good enough </li></ul></ul><ul><ul><li>On average > 20% words lost </li></ul></ul><ul><ul><li>Loss of letters, words, and “significant” words </li></ul></ul>
  4. 4. Common characteristics of digital text images …
  5. 5. … and their effects on OCR
  6. 6. Collaborative correction: some prior and ongoing systems
  7. 7. ManyHands .pdf
  8. 8. The IMPACT view: <ul><li>“ Mass digitisation demands the creation of a new digitisation paradigm by mobilising the general public to help with large-scale digitisation efforts. Because state-of-the-art systems to enable public participation are limited, we intend to resolve this difficulty by adopting advanced tools that will facilitate volunteer participation.” </li></ul>
  9. 9. The IMPACT approach: <ul><li>CONCERT (Cooperative Engine for the Correction of Extracted Text): a data validation/correction application that is simple and intuitive enough to be attractive to untrained users and yet effective enough to ensure high productivity </li></ul><ul><li>Featuring automatic data management that allows for the combination of results from several users at once </li></ul>
  10. 10. System architecture <ul><li>Secure login </li></ul><ul><li>Upload of books/volumes as </li></ul><ul><li>image files or by URL </li></ul><ul><li>Omni-OCR with language </li></ul><ul><li>selection </li></ul><ul><li>Download of compiled OCR </li></ul><ul><li>metadata before or after key-in </li></ul>
  11. 11. System workflow: <ul><li>Three stages: </li></ul><ul><li>Character (carpet) session – for fast validation of OCR results </li></ul><ul><li>Word session – in cases where contextual information is needed to validate characters </li></ul><ul><li>Page-level session – in cases where full page view is needed to interpret results </li></ul>
  12. 12. Character session <ul><li>Analysis of OCR results: </li></ul><ul><ul><li>High confidence results do not require verification </li></ul></ul><ul><ul><li>However, some high confidence results may be misrecognitions </li></ul></ul><ul><ul><li>Individual character images are extracted and grouped together based on recognition results </li></ul></ul><ul><ul><li>User selects and submits suspicious characters </li></ul></ul>
  13. 13. Word session <ul><li>Shows words that contain low confidence characters </li></ul><ul><li>Shows words that contain characters identified as suspicious in the Character Session </li></ul><ul><li>Shows original OCR recognition results and possible spelling options </li></ul><ul><li>Users can validate the appropriate spelling option or provide their own spelling </li></ul>
  14. 14. Page Session <ul><li>Used primarily where a segmentation failure has led to a word being misrecognised or not recognised at all </li></ul><ul><li>Text can be shown in a variety of segmentation views: word, line, paragraph or tag </li></ul><ul><li>System can be automated to move from one problematic word to the next </li></ul>
  15. 15. System demonstration <ul><li> </li></ul><ul><li>Screencast created by Gerd Zechmeister of the Austrian National Library </li></ul>
  16. 16. Development plan <ul><li>User/job monitoring </li></ul><ul><ul><li>Introduction of planned errors to validate operator progress </li></ul></ul><ul><li>Productivity and quality benchmarking </li></ul><ul><li>Correction and validation system for “document dictionaries” </li></ul><ul><ul><li>Each corrected book produces a unique dictionary that can be reused on other works </li></ul></ul><ul><ul><li>Will include general language and name dictionaries being produced by IMPACT </li></ul></ul><ul><li>Distributed workflow management tool and administrative session </li></ul><ul><li>Greater number of output formats for OCR results and corrected results </li></ul><ul><li>“ Superkey” session </li></ul><ul><ul><li>Character validation by creation of “ideal” character patterns </li></ul></ul>
  17. 17. Alternatives to OCR and post-correction: Word Spotting <ul><li>Alternative technique for indexing historical documents </li></ul><ul><li>After word segmentation relevant words are detected and highlighted </li></ul><ul><li>Key words can be person and location names (e.g. taken from the Named Entities Registry) </li></ul>