IMPACT Final Conference - Asaf Tzadok

CONCERT CO operative e N gine for C orrection of E xt R acted T ext Asaf Tzadok Manager, Image and Document Analytics Group October 2011

Introduction An estimated of at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century. A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing. The digitization process involves Scanning technologies OCR (Optical Character Recognition) Post correction The OCR quality range between 50%-90% of word level accuracy Post correction is a must and it costs a lot and it takes time ~1 euro per A5 page

Crowd Sourcing Projects Distributed Proofreaders Gutenberg Project National Library of Australia Australian Newspaper Digitisation LDS Church Family Search The National Library of Finland Digitalkoot All are pure volunteer based crowd sourcing programs It works !!

Gutenberg Project – 1 st Gen.

NLA – Australian Newspapers – 2 nd Gen.

Collaborative Correction – State of the Art cont. State-of-the-art systems, such as Project Gutenberg, Simply show page image and OCR results to be corrected Drawbacks: Slow and unproductive process Prone to errors Hard to cross-check/merge Two passes are needed to ensure quality Result: Complex, hard to track process = a lot of manual labor = limited public participation and contribution

DIGITALKOOT - Mole Games – 3 rd Gen

Collaborative Correction – Games Wider and younger public participation Easy to cross check Allows Parallelism Fully Scalable Drawbacks Low productivity factor Static process with huge amount of work Limited use of the feedback from the users Very long process to complete the digitization

Collaborative Correction – How does it work A full web based collaborative-correction system Avoid any installation in the client side Intuitive for the wide public use Call for participation (optional) Via the official website of the library Collection based Volunteers keen on contributing to their cultural heritage preservation Top performers lists Library recognition awards Acknowledgements

CONCERT Adaptive collaborative correction platform Uses the feedback from the users to improve productivity Fully connected to the Adaptive OCR Engine Strong emphasis on productivity tools Reduce the time for verification/correction Patented smart-key approach Motivate volunteers Separating data entry process into several complementary tasks Optimized application dedicated to each task Break down the tasks into subtask Make it suitable for parallel processing Online compilation Digitization flow optimizations Hierarchical context-level : character -> word -> page

CONCERT System Architecture Image Enhancements OMNI Engine (ABBYY FRE) Book Fonts Extraction Book Optimized Adaptive OCR Engine CONCERT Quality Control Dictionaries Scanned Book High Quality Transcription Web Users CONCERT Productivity Tools CONCERT Games

Adaptive OCR Major key player in the CONCERT System Key Technologies Adaptive OCR for books OCR of books by word recognition Adaptive Optical Character Recognition on a Document with Distorted Characters Papers Word-Based Adaptive OCR for Historical Books, ICDAR 2009 Co-operation with Apostolos Antonacopoulos from Salford University Hybrid Approach to Adaptive OCR, ICDAR 2011

Adaptive OCR - Requirements Consistent and reliable confidence level Important for quality assurance No use of prior knowledge on the font Crazy font can be handled Good use of the feedback from the users Character and Word level Robust to distortion Page level distortion and printing variations Easy to migrate between books from the same publisher Continues update Not too slow Around 2-3 times slower than OMNI Engines

Adaptive OCR – Technical Considerations Pixel Domain (Template matching) Pros Easy to implement Scoring consistency Cons Slow Sensitive to small distortion Features Domain Pros Fast Robust to small distortion Using invariant features can improve robustness to distortion Cons Non consistent scoring mechanism

Adaptive OCR - Hybrid Approach

Distortion Example Using hierarchical optic-flow High quality results for compensation for non-linear character warping Can overcome significant distortions

Adaptive OCR – Figure of Merit Quality measure of the OCR, taking into the following reject rate – how many characters should be verified by the user substitution rate – errors in not rejected characters We have the following formula where each substiution is equal to 5 rejected characters in terms of work load FoM = (NoR + 5 ∗ NoF) / (NoW)

System flow Character (Carpet) session Fast validation of OCR results Every word with rejected character is routed to Word Session Word Verifier Session Utilized for cases when contextual information is necessary Rejected word will be corrected in the Page Session Page-level Session For final closure of the page When entire page view for understanding is required

Character Session OCR results are analyzed: Very high confidence results don’t require verification High confidence results may include some character recognition errors. Hence, character session is used Low confidence results may have been caused by segmentation errors. Hence Word session is used. For Character session, individual character images are extracted and grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session) For the given session, all the characters are grouped based on their confidence

Word Session Used for words Word is not in dictionary Having low confidence characters Having characters rejected in the Character Session Shows Original word image Recognition results Possible spelling options Words ordered alphabetic Based on the recognition results in lexicographic

Word Session – Before data entry

Word Session – After data entry

Page Session Used for correction of cases where word segmentation fails Can be activated in one of 4 flavors Word view Line view Paragraph view Tagging view System can go automatically from one problematic word to another

Multilingual Support - English 1772

Multilingual Support - French 1668

Multilingual Support - German Gothic 1778

Multilingual Support - Dutch 1789

Multilingual Support - Japanese

Heart Newsreel Collection – Index Card

User Monitoring Wide public participation may end up with data corruption by Malicious users Non qualified users User rating and feedback motivates the use of the system Three ways validation Good injection Characters/Words with high confidence to be true Similar injection Characters/Words which may look similar but not identical For example: ‘O’ injection on ‘Q’ session Error injection Characters/Words with high confidence to be false

User Monitoring – Screenshots

User Monitoring – Screenshots Cont.

CONCERT in use Hearst Newsreel Archive Collection First production use Tagging capabilities Pilot in Japan for the Japanese Library Including customization for Japanese 1 st phase pilots in major libraries in Europe KB – National Library of the Netherlands BL – British Library BSB – Bavarian State Library

CONCERT Future Planning Search Over OCR Beyond transcription Improve User Feedback Online advisor Best performers list Community building around content Integrate community tools within the platform CONCERT Games iPhone/iPad/Android/Desktop E-Book creation Fully digital transcription Using original font as option Page distortion correction Fully integrate the word-based page distortion correction

IMPACT Final Conference - Asaf Tzadok

More Related Content

Viewers also liked

Similar to IMPACT Final Conference - Asaf Tzadok

More from IMPACT Centre of Competence

Recently uploaded

IMPACT Final Conference - Asaf Tzadok

Editor's Notes