IMPACT Final Conference - Asaf Tzadok


Published on

IBM Adaptive OCR engine and CONCERT (Cooperative Correction (including the library perspective)

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Page session – see the entire page – System breaks out words, lines, letters (what you want) – color coding shows initial confidence – blue is high, red is low, green is corrected – move from
  • IMPACT Final Conference - Asaf Tzadok

    1. 1. CONCERT CO operative e N gine for C orrection of E xt R acted T ext Asaf Tzadok Manager, Image and Document Analytics Group October 2011
    2. 2. Introduction <ul><li>An estimated of at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century. </li></ul><ul><li>A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing. </li></ul><ul><li>The digitization process involves </li></ul><ul><ul><li>Scanning technologies </li></ul></ul><ul><ul><li>OCR (Optical Character Recognition) </li></ul></ul><ul><ul><li>Post correction </li></ul></ul><ul><li>The OCR quality range between 50%-90% of word level accuracy </li></ul><ul><li>Post correction is a must and it costs a lot and it takes time </li></ul><ul><ul><li>~1 euro per A5 page </li></ul></ul>
    3. 3. Crowd Sourcing Projects <ul><li>Distributed Proofreaders </li></ul><ul><ul><li>Gutenberg Project </li></ul></ul><ul><li>National Library of Australia </li></ul><ul><ul><li>Australian Newspaper Digitisation </li></ul></ul><ul><li>LDS Church </li></ul><ul><ul><li>Family Search </li></ul></ul><ul><li>The National Library of Finland </li></ul><ul><ul><li>Digitalkoot </li></ul></ul><ul><li>All are pure volunteer based crowd sourcing programs </li></ul><ul><ul><li>It works !! </li></ul></ul>
    4. 4. Gutenberg Project – 1 st Gen.
    5. 5. NLA – Australian Newspapers – 2 nd Gen.
    6. 6. Collaborative Correction – State of the Art cont. <ul><li>State-of-the-art systems, such as Project Gutenberg, </li></ul><ul><ul><li>Simply show page image and OCR results to be corrected </li></ul></ul><ul><li>Drawbacks: </li></ul><ul><ul><li>Slow and unproductive process </li></ul></ul><ul><ul><li>Prone to errors </li></ul></ul><ul><ul><li>Hard to cross-check/merge </li></ul></ul><ul><ul><li>Two passes are needed to ensure quality </li></ul></ul><ul><li>Result: Complex, hard to track process = a lot of manual labor = limited public participation and contribution </li></ul>
    7. 7. DIGITALKOOT - Mole Games – 3 rd Gen
    8. 8. Collaborative Correction – Games <ul><li>Wider and younger public participation </li></ul><ul><li>Easy to cross check </li></ul><ul><li>Allows Parallelism </li></ul><ul><li>Fully Scalable </li></ul><ul><li>Drawbacks </li></ul><ul><ul><li>Low productivity factor </li></ul></ul><ul><ul><li>Static process with huge amount of work </li></ul></ul><ul><ul><li>Limited use of the feedback from the users </li></ul></ul><ul><ul><li>Very long process to complete the digitization </li></ul></ul>
    9. 9. Collaborative Correction – How does it work <ul><li>A full web based collaborative-correction system </li></ul><ul><ul><li>Avoid any installation in the client side </li></ul></ul><ul><ul><li>Intuitive for the wide public use </li></ul></ul><ul><li>Call for participation (optional) </li></ul><ul><ul><li>Via the official website of the library </li></ul></ul><ul><ul><li>Collection based </li></ul></ul><ul><li>Volunteers keen on contributing to their cultural heritage preservation </li></ul><ul><ul><li>Top performers lists </li></ul></ul><ul><ul><li>Library recognition awards </li></ul></ul><ul><ul><li>Acknowledgements </li></ul></ul>
    10. 10. CONCERT <ul><li>Adaptive collaborative correction platform </li></ul><ul><ul><li>Uses the feedback from the users to improve productivity </li></ul></ul><ul><ul><li>Fully connected to the Adaptive OCR Engine </li></ul></ul><ul><li>Strong emphasis on productivity tools </li></ul><ul><ul><li>Reduce the time for verification/correction </li></ul></ul><ul><ul><ul><li>Patented smart-key approach </li></ul></ul></ul><ul><ul><li>Motivate volunteers </li></ul></ul><ul><li>Separating data entry process into several complementary tasks </li></ul><ul><ul><li>Optimized application dedicated to each task </li></ul></ul><ul><ul><li>Break down the tasks into subtask </li></ul></ul><ul><ul><li>Make it suitable for parallel processing </li></ul></ul><ul><ul><li>Online compilation </li></ul></ul><ul><li>Digitization flow optimizations </li></ul><ul><ul><li>Hierarchical context-level : character -> word -> page </li></ul></ul>
    11. 11. CONCERT System Architecture Image Enhancements OMNI Engine (ABBYY FRE) Book Fonts Extraction Book Optimized Adaptive OCR Engine CONCERT Quality Control Dictionaries Scanned Book High Quality Transcription Web Users CONCERT Productivity Tools CONCERT Games
    12. 12. Adaptive OCR <ul><li>Major key player in the CONCERT System </li></ul><ul><li>Key Technologies </li></ul><ul><ul><li>Adaptive OCR for books </li></ul></ul><ul><ul><li>OCR of books by word recognition </li></ul></ul><ul><ul><li>Adaptive Optical Character Recognition on a Document with Distorted Characters </li></ul></ul><ul><li>Papers </li></ul><ul><ul><li>Word-Based Adaptive OCR for Historical Books, ICDAR 2009 </li></ul></ul><ul><ul><ul><li>Co-operation with Apostolos Antonacopoulos from Salford University </li></ul></ul></ul><ul><ul><li>Hybrid Approach to Adaptive OCR, ICDAR 2011 </li></ul></ul>
    13. 13. Adaptive OCR - Requirements <ul><li>Consistent and reliable confidence level </li></ul><ul><ul><li>Important for quality assurance </li></ul></ul><ul><li>No use of prior knowledge on the font </li></ul><ul><ul><li>Crazy font can be handled </li></ul></ul><ul><li>Good use of the feedback from the users </li></ul><ul><ul><li>Character and Word level </li></ul></ul><ul><li>Robust to distortion </li></ul><ul><ul><li>Page level distortion and printing variations </li></ul></ul><ul><li>Easy to migrate between books from the same publisher </li></ul><ul><ul><li>Continues update </li></ul></ul><ul><li>Not too slow </li></ul><ul><ul><li>Around 2-3 times slower than OMNI Engines </li></ul></ul>
    14. 14. Adaptive OCR – Technical Considerations <ul><li>Pixel Domain (Template matching) </li></ul><ul><ul><li>Pros </li></ul></ul><ul><ul><ul><li>Easy to implement </li></ul></ul></ul><ul><ul><ul><li>Scoring consistency </li></ul></ul></ul><ul><ul><li>Cons </li></ul></ul><ul><ul><ul><li>Slow </li></ul></ul></ul><ul><ul><ul><li>Sensitive to small distortion </li></ul></ul></ul><ul><li>Features Domain </li></ul><ul><ul><li>Pros </li></ul></ul><ul><ul><ul><li>Fast </li></ul></ul></ul><ul><ul><ul><li>Robust to small distortion </li></ul></ul></ul><ul><ul><ul><li>Using invariant features can improve robustness to distortion </li></ul></ul></ul><ul><ul><li>Cons </li></ul></ul><ul><ul><ul><li>Non consistent scoring mechanism </li></ul></ul></ul>
    15. 15. Adaptive OCR - Hybrid Approach
    16. 16. Distortion Example <ul><li>Using hierarchical optic-flow </li></ul><ul><li>High quality results for compensation for non-linear character warping </li></ul><ul><li>Can overcome significant distortions </li></ul>
    17. 17. Adaptive OCR – Figure of Merit <ul><li>Quality measure of the OCR, taking into the following </li></ul><ul><ul><li>reject rate – how many characters should be verified by the user </li></ul></ul><ul><ul><li>substitution rate – errors in not rejected characters </li></ul></ul><ul><li>We have the following formula where each substiution is equal to 5 rejected characters in terms of work load </li></ul><ul><li>FoM = (NoR + 5 ∗ NoF) / (NoW) </li></ul>
    18. 18. System flow <ul><li>Character (Carpet) session </li></ul><ul><ul><li>Fast validation of OCR results </li></ul></ul><ul><ul><li>Every word with rejected character is routed to Word Session </li></ul></ul><ul><li>Word Verifier Session </li></ul><ul><ul><li>Utilized for cases when contextual information is necessary </li></ul></ul><ul><ul><li>Rejected word will be corrected in the Page Session </li></ul></ul><ul><li>Page-level Session </li></ul><ul><ul><li>For final closure of the page </li></ul></ul><ul><ul><li>When entire page view for understanding is required </li></ul></ul>
    19. 19. Character Session <ul><li>OCR results are analyzed: </li></ul><ul><ul><li>Very high confidence results don’t require verification </li></ul></ul><ul><ul><li>High confidence results may include some character recognition errors. Hence, character session is used </li></ul></ul><ul><ul><li>Low confidence results may have been caused by segmentation errors. Hence Word session is used. </li></ul></ul><ul><li>For Character session, individual character images are extracted and grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session) </li></ul><ul><li>For the given session, all the characters are grouped based on their confidence </li></ul>
    20. 20. Character Session
    21. 21. Character Session
    22. 22. Character Session
    23. 23. Word Session <ul><li>Used for words </li></ul><ul><ul><li>Word is not in dictionary </li></ul></ul><ul><ul><li>Having low confidence characters </li></ul></ul><ul><ul><li>Having characters rejected in the Character Session </li></ul></ul><ul><li>Shows </li></ul><ul><ul><li>Original word image </li></ul></ul><ul><ul><li>Recognition results </li></ul></ul><ul><ul><li>Possible spelling options </li></ul></ul><ul><li>Words ordered alphabetic </li></ul><ul><ul><li>Based on the recognition results in lexicographic </li></ul></ul>
    24. 24. Word Session – Before data entry
    25. 25. Word Session – After data entry
    26. 26. Word Session – Before data entry
    27. 27. Word Session – After data entry
    28. 28. Page Session <ul><li>Used for correction of cases where word segmentation fails </li></ul><ul><li>Can be activated in one of 4 flavors </li></ul><ul><ul><li>Word view </li></ul></ul><ul><ul><li>Line view </li></ul></ul><ul><ul><li>Paragraph view </li></ul></ul><ul><ul><li>Tagging view </li></ul></ul><ul><li>System can go automatically from one problematic word to another </li></ul>
    29. 29. CONCERT - Page Session
    30. 30. Multilingual Support - English 1772
    31. 31. Multilingual Support - French 1668
    32. 32. Multilingual Support - German Gothic 1778
    33. 33. Multilingual Support - Dutch 1789
    34. 34. Multilingual Support - Japanese
    35. 35. Heart Newsreel Collection – Index Card
    36. 36. User Monitoring <ul><li>Wide public participation may end up with data corruption by </li></ul><ul><ul><li>Malicious users </li></ul></ul><ul><ul><li>Non qualified users </li></ul></ul><ul><li>User rating and feedback motivates the use of the system </li></ul><ul><li>Three ways validation </li></ul><ul><ul><li>Good injection </li></ul></ul><ul><ul><ul><li>Characters/Words with high confidence to be true </li></ul></ul></ul><ul><ul><li>Similar injection </li></ul></ul><ul><ul><ul><li>Characters/Words which may look similar but not identical </li></ul></ul></ul><ul><ul><ul><li>For example: ‘O’ injection on ‘Q’ session </li></ul></ul></ul><ul><ul><li>Error injection </li></ul></ul><ul><ul><ul><li>Characters/Words with high confidence to be false </li></ul></ul></ul>
    37. 37. User Monitoring – Screenshots
    38. 38. User Monitoring – Screenshots Cont.
    39. 39. User Monitoring – Screenshots Cont.
    40. 40. User Monitoring – Screenshots Cont.
    41. 41. User Monitoring – Screenshots Cont.
    42. 42. User Monitoring – Screenshots Cont.
    43. 43. User Monitoring – Screenshots Cont.
    44. 44. CONCERT Games
    45. 45. CONCERT in use <ul><li>Hearst Newsreel Archive Collection </li></ul><ul><ul><li>First production use </li></ul></ul><ul><ul><li>Tagging capabilities </li></ul></ul><ul><li>Pilot in Japan for the Japanese Library </li></ul><ul><ul><li>Including customization for Japanese </li></ul></ul><ul><li>1 st phase pilots in major libraries in Europe </li></ul><ul><ul><li>KB – National Library of the Netherlands </li></ul></ul><ul><ul><li>BL – British Library </li></ul></ul><ul><ul><li>BSB – Bavarian State Library </li></ul></ul>
    46. 46. CONCERT Future Planning <ul><li>Search Over OCR </li></ul><ul><ul><li>Beyond transcription </li></ul></ul><ul><li>Improve User Feedback </li></ul><ul><ul><li>Online advisor </li></ul></ul><ul><ul><li>Best performers list </li></ul></ul><ul><li>Community building around content </li></ul><ul><ul><li>Integrate community tools within the platform </li></ul></ul><ul><li>CONCERT Games </li></ul><ul><ul><li>iPhone/iPad/Android/Desktop </li></ul></ul><ul><li>E-Book creation </li></ul><ul><ul><li>Fully digital transcription </li></ul></ul><ul><ul><li>Using original font as option </li></ul></ul><ul><li>Page distortion correction </li></ul><ul><ul><li>Fully integrate the word-based page distortion correction </li></ul></ul>
    47. 47. Thank You!