Your SlideShare is downloading. ×
0
Content Conversion Specialists GmbH Applied IMPACT Does the new FineReader Engine and Dutch lexicon increase OCR accuracy ...
Agenda <ul><li>Scope </li></ul><ul><li>Improving Text Accuracy </li></ul><ul><li>Test Material </li></ul><ul><li>Test Syst...
Scope <ul><li>Testing results and tools from the IMPACT project in a real mass production environment </li></ul><ul><li>IM...
Improving Text Accuracy <ul><li>Various pre- and post-processing steps as well as extension of the OCR have an impact on t...
Test Material <ul><li>17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.”   printed with Fraktur/Gothic f...
 
Test Material <ul><li>17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.”   printed with Fraktur/Gothic f...
<ul><li>17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.”   printed with Fraktur/Gothic fonts   </li></...
Test System <ul><li>docWorks – Large Scale Digitisation Workflow </li></ul><ul><li>Provides layout analysis for page segme...
Test System QA+Correction QA+Correction Re-Scan Conversion Imaging Layout Analysis OCR ISR Reject Condition Delivery QA  r...
Test System <ul><li>docWorks – Large Scale Digitisation Workflow </li></ul><ul><li>Provides layout analysis for page segme...
Test System <ul><li>docWorks – Large Scale Digitisation Workflow </li></ul><ul><li>Provides layout analysis for page segme...
Provided IMPACT Tools <ul><li>ABBYY FineReader Engine 10 with Gothic/Fraktur extension and standard interface for integrat...
Integration with Test System <ul><li>docWorks test system built </li></ul><ul><li>DLL integrated with docWorks code </li><...
Test Scenario <ul><li>GTM sample processing: </li></ul><ul><li>- based on segmentation obtained from the page.xml files  <...
Evaluation Method <ul><li>Goal was to generate statistical data for character and word accuracy of all 4 test runs through...
Evaluation Results (1) <ul><li>docWorks  </li></ul><ul><li>text correction  mode </li></ul><ul><li>Long S recognition </li...
Evaluation Results (2) 0 10 20 30 40 50 60 70 Levenshtein percentage FRE9+SD FRE10+SD FRE10+CD FRE10+DD SD = Standard Dutc...
Conclusion <ul><li>Improved ABBYY OCR and historical dictionaries enable higher text accuracy and lower the effort for tex...
Thank you CCS Content Conversion Specialists GmbH information : accessible Weidestr. 134, D-22083 Hamburg, Germany  +49 (0...
Upcoming SlideShare
Loading in...5
×

IMPACT Final Conference - Claus Gravenhorst

556

Published on

Applied IMPACT: Does the new FineReader Engine and Dutch Lexicon increase OCR accuracy and production efficiency? A case study by KB and CCS.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
556
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
11
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "IMPACT Final Conference - Claus Gravenhorst"

  1. 1. Content Conversion Specialists GmbH Applied IMPACT Does the new FineReader Engine and Dutch lexicon increase OCR accuracy and production efficiency? A case study by KB and CCS Claus Gravenhorst, Director Strategic Initiatives Final IMPACT Conference, London, 2011-10-24
  2. 2. Agenda <ul><li>Scope </li></ul><ul><li>Improving Text Accuracy </li></ul><ul><li>Test Material </li></ul><ul><li>Test System </li></ul><ul><li>Provided IMPACT Tools </li></ul><ul><li>Integration with Test System </li></ul><ul><li>Test Scenario </li></ul><ul><li>Evaluation Method </li></ul><ul><li>Evaluation Results </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Scope <ul><li>Testing results and tools from the IMPACT project in a real mass production environment </li></ul><ul><li>IMPACT provides improved OCR technology as well as historical dictionaries for 9 European languages </li></ul><ul><li>Motivation of CCS </li></ul><ul><li>- Benefit from technology improvements </li></ul><ul><li>- Increase the level of automation for small, mid and large scale </li></ul><ul><li> digitisation workflows </li></ul><ul><li>- Prevent from reinventing the wheel in specific areas such as OCR </li></ul><ul><li> and language technology </li></ul>
  4. 4. Improving Text Accuracy <ul><li>Various pre- and post-processing steps as well as extension of the OCR have an impact on the text accuracy </li></ul><ul><li>Dictionary </li></ul><ul><li>Pattern Training </li></ul><ul><li>Deskew </li></ul><ul><li>Cropping </li></ul><ul><li>Dewarping </li></ul><ul><li>etc. </li></ul><ul><li>Dictionary </li></ul><ul><li>Linguistic methods </li></ul><ul><li>Crowd sourcing </li></ul><ul><li>Zoning </li></ul><ul><li>Classification </li></ul><ul><li>Ordering </li></ul><ul><li>Grouping </li></ul>Image Image Pre-Processing OCR Segmentation Layout Analysis Text Correction <ul><li>Information Retrieval benefits: </li></ul><ul><li>cleaner index, more relevant hits </li></ul>
  5. 5. Test Material <ul><li>17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.” printed with Fraktur/Gothic fonts </li></ul><ul><li>Databank of Digital Daily Newspapers (DDD) </li></ul><ul><li>- 1619 - 1635, 73 issues, 144 pages </li></ul><ul><li>- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, </li></ul><ul><li>saved with Adobe Photoshop CS4 </li></ul><ul><li>IMPACT Ground Truth Material (GTM) </li></ul><ul><li>- 1620 - 1632, 33 issues, 72 pages, overlap with DDD pages </li></ul><ul><li>- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, </li></ul><ul><li> saved with ImageMagick 6.5.7 </li></ul><ul><li>- page.xml with segment/zone coordinates and keyed text </li></ul>
  6. 7. Test Material <ul><li>17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.” printed with Fraktur/Gothic fonts </li></ul><ul><li>Databank of Digital Daily Newspapers (DDD) </li></ul><ul><li>- 1619 - 1635, 73 issues, 144 pages </li></ul><ul><li>- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, </li></ul><ul><li>saved with Adobe Photoshop CS4 </li></ul>
  7. 8. <ul><li>17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.” printed with Fraktur/Gothic fonts </li></ul><ul><li>Databank of Digital Daily Newspapers (DDD) </li></ul><ul><li>- 1619 - 1635, 73 issues, 144 pages </li></ul><ul><li>- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, </li></ul><ul><li>saved with Adobe Photoshop CS4 </li></ul>
  8. 9. Test System <ul><li>docWorks – Large Scale Digitisation Workflow </li></ul><ul><li>Provides layout analysis for page segmentation and zone classification </li></ul><ul><li>Developed during the EU funded FP5 research project METAe (2000 – 2003) </li></ul><ul><li>Used for small, mid and large scale digitisation projects by cultural heritage institutions and service providers around the world (e.g. BL books/newspapers, KB DDD newspapers, Proquest/EEB, etc.) </li></ul><ul><li>Provides structural analysis for recognition of logical entities </li></ul>
  9. 10. Test System QA+Correction QA+Correction Re-Scan Conversion Imaging Layout Analysis OCR ISR Reject Condition Delivery QA random Final Output Book Delivery QA+Correction Scanning Image Metadata Database ---------------------- Repository Metadata Z 39.50 Automated QA Document UID Barcode Item Tracking Manual QA in-house, near-shore, off-shore multiple locations Manual QS in-house, near-shore Check in Check out Robot-Scanner Book-Scanner Document-Scanner Microfilm-Scanner
  10. 11. Test System <ul><li>docWorks – Large Scale Digitisation Workflow </li></ul><ul><li>Provides layout analysis for page segmentation and zone classification </li></ul><ul><li>Developed during the EU funded FP5 research project METAe (2000 – 2003) </li></ul><ul><li>Used for small, mid and large scale digitisation projects by cultural heritage institutions and service providers around the world (e.g. BL books/newspapers, KB DDD newspapers, Proquest/EEB, etc.) </li></ul><ul><li>Provides structural analysis for recognition of logical entities </li></ul>Conversion Image Pre-processing Layout Analysis OCR (ABBYY) Structural Analysis (ISR)
  11. 12. Test System <ul><li>docWorks – Large Scale Digitisation Workflow </li></ul><ul><li>Provides layout analysis for page segmentation and zone classification </li></ul><ul><li>Developed during the EU funded FP5 research project METAe (2000 – 2003) </li></ul><ul><li>Used for small, mid and large scale digitisation projects by cultural heritage institutions and service providers around the world (e.g. BL books/newspapers, KB DDD newspapers, Proquest/EEB, etc.) </li></ul><ul><li>Provides structural analysis for recognition of logical entities </li></ul>
  12. 13. Provided IMPACT Tools <ul><li>ABBYY FineReader Engine 10 with Gothic/Fraktur extension and standard interface for integration of external dictionaries </li></ul><ul><li>Corpus based dictionary of DBNL – Digitale Bibliotheek for de Nederlandse Letteren ( www.dbnl.org ), 16th – 19th century </li></ul><ul><li>Dictionary based dictionary from the WNT – Woordenboek der Nederlandsche Taal, 16th – 19th century </li></ul><ul><li>DLL incl. documentation for: </li></ul><ul><li>- access to the dictionaries and integration into the OCR </li></ul><ul><li> process via the FRE10 external dictionary interface </li></ul><ul><li>- access to routines for fixing misrecognition of “long S” </li></ul><ul><li> characters </li></ul>
  13. 14. Integration with Test System <ul><li>docWorks test system built </li></ul><ul><li>DLL integrated with docWorks code </li></ul><ul><li>External dictionaries callable via DLL and ABBYY external dictionary interface </li></ul><ul><li>Minor FRE10 bug identified during integration phase. </li></ul><ul><li>ABBYY support immediately provided a workaround. </li></ul><ul><li>Overall, the integration went smoothly </li></ul>
  14. 15. Test Scenario <ul><li>GTM sample processing: </li></ul><ul><li>- based on segmentation obtained from the page.xml files </li></ul><ul><li>- without any image pre-processing </li></ul><ul><li>DDD processing: </li></ul><ul><li>- comparison with GTM sample processing showed slightly better results </li></ul><ul><li>- image pre-processing and segmentation by docWorks </li></ul><ul><li>- text correction for a complete run to create DDD Ground Truth Text </li></ul><ul><li>4 runs with DDD images: </li></ul><ul><li>- FR engine (FRE) 9 and standard Dutch dictionary </li></ul><ul><li>- FRE 10 and standard Dutch dictionary </li></ul><ul><li>- FRE 10 and corpus based dictionary incl. “long S” fix </li></ul><ul><li>- FRE 10 and dictionary based dictionary incl. “long S” fix </li></ul>
  15. 16. Evaluation Method <ul><li>Goal was to generate statistical data for character and word accuracy of all 4 test runs through automated comparison of text output with DDD Ground Truth Text </li></ul><ul><li>Computing of accuracy rates is based on the Levenshtein Algorithm. The Levenshtein distance represents the number of actions (insert, delete, substitute) needed to transform one text into another </li></ul><ul><li>The Levenshtein percentage is the value represented by the Levenshtein Distance multiplied by 100 and divided by the number of characters or words of the correct DDD Ground Truth Text </li></ul>
  16. 17. Evaluation Results (1) <ul><li>docWorks </li></ul><ul><li>text correction mode </li></ul><ul><li>Long S recognition </li></ul><ul><li>e.g. in “Duytslant” </li></ul>
  17. 18. Evaluation Results (2) 0 10 20 30 40 50 60 70 Levenshtein percentage FRE9+SD FRE10+SD FRE10+CD FRE10+DD SD = Standard Dutch Dictionary CD = Corpus based Dictionary DD = Dictionary based Dictionary 17,64 18,85 23,18 24,33 % characters words 66,38 64,87 56,54 52,70 <ul><li>Smaller value represents a higher text accuracy </li></ul><ul><li>Improvement in character accuracy is 27,5 % (FRE10+DD vs. FRE9+SD) </li></ul><ul><li>Improvement in word accuracy is 20,6 % (FRE10+DD vs. FRE9+SD) </li></ul>
  18. 19. Conclusion <ul><li>Improved ABBYY OCR and historical dictionaries enable higher text accuracy and lower the effort for text correction </li></ul><ul><li>Tools easy to integrate via DLL and ABBYY interface for external dictionaries </li></ul><ul><li>Future digitisation projects will benefit from historical dictionaries </li></ul><ul><li>Biggest potential for further improvement is in language technology </li></ul>
  19. 20. Thank you CCS Content Conversion Specialists GmbH information : accessible Weidestr. 134, D-22083 Hamburg, Germany +49 (0) 402 2713016 phone +49 (0) 402 2713011 fax +49 (0) 176 12713016 mobile c.gravenhorst@content-conversion.com  Internet: www.content-conversion.com Claus Gravenhorst Director Strategic Initiatives
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×