CONCERT CO operative e N gine for  C orrection of  E xt R acted  T ext Asaf Tzadok Manager, Image and Document Analytics Group October 2011
Introduction An estimated of at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century.  A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing.  The digitization process involves Scanning technologies OCR (Optical Character Recognition) Post correction The OCR quality range between 50%-90% of word level accuracy Post correction is a must and it costs a lot and it takes time ~1 euro per A5 page
Crowd Sourcing Projects Distributed Proofreaders Gutenberg Project National Library of Australia Australian Newspaper Digitisation LDS Church Family Search The National Library of Finland Digitalkoot All are pure volunteer based crowd sourcing programs It works !!
Gutenberg Project – 1 st  Gen.
NLA – Australian Newspapers – 2 nd  Gen.
Collaborative Correction – State of the Art cont. State-of-the-art systems, such as Project Gutenberg,  Simply show page image and OCR results to be corrected Drawbacks:  Slow and unproductive process Prone to errors Hard to cross-check/merge Two passes are needed to ensure quality Result: Complex, hard to track process =  a lot of manual labor =  limited public participation and contribution
DIGITALKOOT -   Mole Games – 3 rd  Gen
Collaborative Correction – Games Wider and younger public participation Easy to cross check Allows Parallelism Fully Scalable Drawbacks Low productivity factor Static process with huge amount of work Limited use of the feedback from the users Very long process to complete the digitization
Collaborative Correction – How does it work A full web based collaborative-correction system  Avoid any installation in the client side Intuitive for the wide public use Call for participation (optional) Via the official website of the library Collection based Volunteers keen on contributing to their cultural heritage preservation  Top performers lists Library recognition awards Acknowledgements
CONCERT Adaptive collaborative correction platform Uses the feedback from the users to improve productivity Fully connected to the Adaptive OCR Engine Strong emphasis on productivity tools Reduce the time for verification/correction Patented smart-key approach Motivate volunteers Separating data entry process into several complementary tasks Optimized application dedicated to each task  Break down the tasks into subtask Make it suitable for parallel processing  Online compilation Digitization flow optimizations Hierarchical context-level : character -> word -> page
CONCERT System Architecture Image Enhancements OMNI Engine (ABBYY FRE) Book Fonts Extraction Book Optimized Adaptive OCR Engine CONCERT Quality Control Dictionaries Scanned Book High Quality Transcription Web Users CONCERT Productivity Tools CONCERT Games
Adaptive OCR Major key player in the CONCERT System Key Technologies Adaptive OCR for books OCR of books by word recognition Adaptive Optical Character Recognition on a Document with Distorted Characters Papers Word-Based Adaptive OCR for Historical Books, ICDAR 2009 Co-operation with Apostolos Antonacopoulos from Salford University Hybrid Approach to Adaptive OCR, ICDAR 2011
Adaptive OCR - Requirements Consistent and reliable confidence level Important for quality assurance No use of prior knowledge on the font Crazy font can be handled Good use of the feedback from the users Character and Word level Robust to distortion Page level distortion and printing variations Easy to migrate between books from the same publisher Continues update Not too slow Around 2-3 times slower than OMNI Engines
Adaptive OCR – Technical Considerations Pixel Domain (Template matching) Pros Easy to implement Scoring consistency Cons Slow Sensitive to small distortion Features Domain Pros Fast Robust to small distortion Using invariant features can improve robustness to distortion Cons Non consistent scoring mechanism
Adaptive OCR - Hybrid Approach
Distortion Example Using hierarchical optic-flow High quality results for compensation for non-linear character warping Can overcome significant distortions
Adaptive OCR – Figure of Merit Quality measure of the OCR, taking into the following reject rate – how many characters should be verified by the user substitution rate – errors in not rejected characters We have the following formula where each substiution is equal to 5 rejected characters in terms of work load FoM = (NoR + 5 ∗ NoF) / (NoW)
System flow Character (Carpet) session Fast validation of OCR results Every word with rejected character is routed to Word Session Word Verifier Session Utilized for cases when contextual information is necessary Rejected word will be corrected in the Page Session Page-level Session For final closure of the page When entire page view for understanding is required
Character Session OCR results are analyzed: Very high confidence results don’t require verification High confidence results may include some character recognition errors. Hence, character session is used Low confidence results may have been caused by segmentation errors. Hence Word session is used. For Character session, individual character images are extracted and grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session)  For the given session, all the characters are grouped based on their confidence
Character Session
Character Session
Character Session
Word Session Used for words Word is not in dictionary Having low confidence characters Having characters rejected in the Character Session Shows  Original word image Recognition results Possible spelling options Words ordered alphabetic  Based on the recognition results in lexicographic
Word Session – Before data entry
Word Session – After data entry
Word Session – Before data entry
Word Session – After data entry
Page Session Used for correction of cases where word segmentation fails Can be activated in one of 4 flavors  Word view Line view Paragraph view Tagging view System can go automatically from one problematic word to another
CONCERT - Page Session
Multilingual Support - English  1772
Multilingual Support - French  1668
Multilingual Support - German Gothic  1778
Multilingual Support - Dutch  1789
Multilingual Support - Japanese
Heart Newsreel Collection – Index Card
User Monitoring Wide public participation may end up with data corruption by Malicious users Non qualified users User rating and feedback motivates the use of the system  Three ways validation Good injection Characters/Words with high confidence to be true Similar injection Characters/Words which may look similar but not identical For example: ‘O’ injection on ‘Q’ session Error injection Characters/Words with high confidence to be false
User Monitoring – Screenshots
User Monitoring   – Screenshots Cont.
User Monitoring   – Screenshots Cont.
User Monitoring   – Screenshots Cont.
User Monitoring   – Screenshots Cont.
User Monitoring   – Screenshots Cont.
User Monitoring   – Screenshots Cont.
CONCERT Games
CONCERT in use Hearst Newsreel Archive Collection First production use Tagging capabilities Pilot in Japan for the Japanese Library Including customization for Japanese 1 st  phase pilots in major libraries in Europe KB – National Library of the Netherlands BL – British Library BSB – Bavarian State Library
CONCERT Future Planning Search Over OCR Beyond transcription Improve User Feedback Online advisor Best performers list Community building around content Integrate community tools within the platform CONCERT Games iPhone/iPad/Android/Desktop E-Book creation Fully digital transcription Using original font as option Page distortion correction Fully integrate the word-based page distortion correction
Thank You!

IMPACT Final Conference - Asaf Tzadok

  • 1.
    CONCERT CO operativee N gine for C orrection of E xt R acted T ext Asaf Tzadok Manager, Image and Document Analytics Group October 2011
  • 2.
    Introduction An estimatedof at least 100 Millions books have been produced since Johann Gutenberg invented movable type in the 15th century. A large part of this vast literature is now being converted to digital books and moved into the world of electronic publishing. The digitization process involves Scanning technologies OCR (Optical Character Recognition) Post correction The OCR quality range between 50%-90% of word level accuracy Post correction is a must and it costs a lot and it takes time ~1 euro per A5 page
  • 3.
    Crowd Sourcing ProjectsDistributed Proofreaders Gutenberg Project National Library of Australia Australian Newspaper Digitisation LDS Church Family Search The National Library of Finland Digitalkoot All are pure volunteer based crowd sourcing programs It works !!
  • 4.
  • 5.
    NLA – AustralianNewspapers – 2 nd Gen.
  • 6.
    Collaborative Correction –State of the Art cont. State-of-the-art systems, such as Project Gutenberg, Simply show page image and OCR results to be corrected Drawbacks: Slow and unproductive process Prone to errors Hard to cross-check/merge Two passes are needed to ensure quality Result: Complex, hard to track process = a lot of manual labor = limited public participation and contribution
  • 7.
    DIGITALKOOT - Mole Games – 3 rd Gen
  • 8.
    Collaborative Correction –Games Wider and younger public participation Easy to cross check Allows Parallelism Fully Scalable Drawbacks Low productivity factor Static process with huge amount of work Limited use of the feedback from the users Very long process to complete the digitization
  • 9.
    Collaborative Correction –How does it work A full web based collaborative-correction system Avoid any installation in the client side Intuitive for the wide public use Call for participation (optional) Via the official website of the library Collection based Volunteers keen on contributing to their cultural heritage preservation Top performers lists Library recognition awards Acknowledgements
  • 10.
    CONCERT Adaptive collaborativecorrection platform Uses the feedback from the users to improve productivity Fully connected to the Adaptive OCR Engine Strong emphasis on productivity tools Reduce the time for verification/correction Patented smart-key approach Motivate volunteers Separating data entry process into several complementary tasks Optimized application dedicated to each task Break down the tasks into subtask Make it suitable for parallel processing Online compilation Digitization flow optimizations Hierarchical context-level : character -> word -> page
  • 11.
    CONCERT System ArchitectureImage Enhancements OMNI Engine (ABBYY FRE) Book Fonts Extraction Book Optimized Adaptive OCR Engine CONCERT Quality Control Dictionaries Scanned Book High Quality Transcription Web Users CONCERT Productivity Tools CONCERT Games
  • 12.
    Adaptive OCR Majorkey player in the CONCERT System Key Technologies Adaptive OCR for books OCR of books by word recognition Adaptive Optical Character Recognition on a Document with Distorted Characters Papers Word-Based Adaptive OCR for Historical Books, ICDAR 2009 Co-operation with Apostolos Antonacopoulos from Salford University Hybrid Approach to Adaptive OCR, ICDAR 2011
  • 13.
    Adaptive OCR -Requirements Consistent and reliable confidence level Important for quality assurance No use of prior knowledge on the font Crazy font can be handled Good use of the feedback from the users Character and Word level Robust to distortion Page level distortion and printing variations Easy to migrate between books from the same publisher Continues update Not too slow Around 2-3 times slower than OMNI Engines
  • 14.
    Adaptive OCR –Technical Considerations Pixel Domain (Template matching) Pros Easy to implement Scoring consistency Cons Slow Sensitive to small distortion Features Domain Pros Fast Robust to small distortion Using invariant features can improve robustness to distortion Cons Non consistent scoring mechanism
  • 15.
    Adaptive OCR -Hybrid Approach
  • 16.
    Distortion Example Usinghierarchical optic-flow High quality results for compensation for non-linear character warping Can overcome significant distortions
  • 17.
    Adaptive OCR –Figure of Merit Quality measure of the OCR, taking into the following reject rate – how many characters should be verified by the user substitution rate – errors in not rejected characters We have the following formula where each substiution is equal to 5 rejected characters in terms of work load FoM = (NoR + 5 ∗ NoF) / (NoW)
  • 18.
    System flow Character(Carpet) session Fast validation of OCR results Every word with rejected character is routed to Word Session Word Verifier Session Utilized for cases when contextual information is necessary Rejected word will be corrected in the Page Session Page-level Session For final closure of the page When entire page view for understanding is required
  • 19.
    Character Session OCRresults are analyzed: Very high confidence results don’t require verification High confidence results may include some character recognition errors. Hence, character session is used Low confidence results may have been caused by segmentation errors. Hence Word session is used. For Character session, individual character images are extracted and grouped together based on the recognition results (i.e. all the “b” would be grouped together at the same session) For the given session, all the characters are grouped based on their confidence
  • 20.
  • 21.
  • 22.
  • 23.
    Word Session Usedfor words Word is not in dictionary Having low confidence characters Having characters rejected in the Character Session Shows Original word image Recognition results Possible spelling options Words ordered alphabetic Based on the recognition results in lexicographic
  • 24.
    Word Session –Before data entry
  • 25.
    Word Session –After data entry
  • 26.
    Word Session –Before data entry
  • 27.
    Word Session –After data entry
  • 28.
    Page Session Usedfor correction of cases where word segmentation fails Can be activated in one of 4 flavors Word view Line view Paragraph view Tagging view System can go automatically from one problematic word to another
  • 29.
  • 30.
  • 31.
  • 32.
    Multilingual Support -German Gothic 1778
  • 33.
  • 34.
  • 35.
  • 36.
    User Monitoring Widepublic participation may end up with data corruption by Malicious users Non qualified users User rating and feedback motivates the use of the system Three ways validation Good injection Characters/Words with high confidence to be true Similar injection Characters/Words which may look similar but not identical For example: ‘O’ injection on ‘Q’ session Error injection Characters/Words with high confidence to be false
  • 37.
  • 38.
    User Monitoring – Screenshots Cont.
  • 39.
    User Monitoring – Screenshots Cont.
  • 40.
    User Monitoring – Screenshots Cont.
  • 41.
    User Monitoring – Screenshots Cont.
  • 42.
    User Monitoring – Screenshots Cont.
  • 43.
    User Monitoring – Screenshots Cont.
  • 44.
  • 45.
    CONCERT in useHearst Newsreel Archive Collection First production use Tagging capabilities Pilot in Japan for the Japanese Library Including customization for Japanese 1 st phase pilots in major libraries in Europe KB – National Library of the Netherlands BL – British Library BSB – Bavarian State Library
  • 46.
    CONCERT Future PlanningSearch Over OCR Beyond transcription Improve User Feedback Online advisor Best performers list Community building around content Integrate community tools within the platform CONCERT Games iPhone/iPad/Android/Desktop E-Book creation Fully digital transcription Using original font as option Page distortion correction Fully integrate the word-based page distortion correction
  • 47.

Editor's Notes

  • #31 Page session – see the entire page – System breaks out words, lines, letters (what you want) – color coding shows initial confidence – blue is high, red is low, green is corrected – move from