IMPACT Final Conference - Michael Fuchs

2,330 views
2,287 views

Published on

ABBYY FineReader: IMPACT Improvements with Michael Fuchs from ABBYY Europe

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,330
On SlideShare
0
From Embeds
0
Number of Embeds
1,351
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

IMPACT Final Conference - Michael Fuchs

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of theNetherlands.ABBYY & OCR Improvements for IMPACT Michael Fuchs Senior Product Marketing Manager ABBYY Europe fuchs@abbyy.com
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of theNetherlands.Agenda  Who is ABBYY?  Company Overview  (Short) Product Overview  ABBYY Technology in the IMPACT project  OCR & Processing – IMPACT improvements  Binarisation, Segmentation,  Recognition  Dictionary API, Export Formats  Lessons Learned, Pricing, Pre-Announcement, Q&A 2
  3. 3. ABBYY & IMPACTABBYY & OCR for IMPACT 3
  4. 4. ABBYY Group Overview ABBYY Group  Founded in 1989 as BIT Software  > 1000 employees in 14 offices worldwide  Headquarters/R&D in Moscow, RussiaABBYY & OCR for IMPACT 4
  5. 5. ABBYY OCR Products – Usage View Desktop/Workgroup Server/Backend SDK/Integration User driven processing, Automated processing, Automated processing, Ready to use Ready to use Development neededOCR & Document Conversion FineReader Recognition Server FineReader Engines (Professional, Corporate, (Professional, Extended (Windows, Linux, Mac OS X, Site Licence Edition) Edition) Free BSD, Embedded Systems) Note: No Gothic/Fraktur OCR! Gothic/Fraktur OCR Mobile OCR Engine & XML Export (Android, Symbian, Linux, PDF Transformer Support! Windows, Windows Mobile, FotoReader iOS ) ScreenshotReader End Users, Companies, Developers,Users are: Companies, Scan Service Provider, Scan Service Provider (Libraries) Libraries IMPACT Research ABBYY & OCR for IMPACT 5
  6. 6. What (ABBYY) OCR can read... Recognition Languages  Almost 200 OCR languages  34 languages with dictionary support and spell check  Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai  Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs (Chinese (traditional and simplified), Japanese, Korean)  Arabic (Technical Preview in the SDK) Font Types  Recognition of mixed font types (dot-matrix printer, typewriter, Gothic, etc.)  OCR-A  OCR-B  MICR (E13B)  CMC-7ABBYY & OCR for IMPACT 6
  7. 7. IMPACT & ABBYY ABBYY is the OCR technology provider for IMPACT members ABBYY also improved the core technologies for the recognition of old documents in IMPACT, focus areas are/were:  Image pre-processing  Segmentation  Character recognition  Export IMPACT members work with the Software Development Kit (SDK) FineReader Engine – not the desktop application IMPACT focus is/was on research and not in setting up a production system ;o) Improved technologies are/will be added to current/future productsABBYY & OCR for IMPACT 7
  8. 8. Designed to be not OCRedABBYY & OCR for IMPACT 8
  9. 9. Why ABBYY? - OCR … Original Image [perfect quality :o) ] Std. OCR * ABBYY Fraktur OCR* *Recognition Server 3.0 R1 – Gothic/Fraktur disabled and enabled ABBYY & OCR for IMPACT 9
  10. 10. ABBYY “History” and Old Fonts Recognition FineReader XIX (V7 Technology) 2003 (METAe result 2000-2003) FineReader Engine 9.0 (Release 1) 2008 (Pre-IMPACT – “State of the Art”) FineReader Engine 10 2010 IMPACT Project OptimizationsABBYY & OCR for IMPACT 10
  11. 11. ABBYY and Old European FontsAccuracy Comparison: Up to 98,2 % on good quality images 2003 2008 2010ABBYY Technology Version 10 recognition of old European fonts: 25% more accurate than FRE 9.0 38% more accurate than FR XIX ABBYY & OCR for IMPACT 11
  12. 12. OCR Processing Steps & ABBYY Improvements for IMPACTABBYY & OCR for IMPACT 12
  13. 13. Processing Steps Step 1. Scanning, Image Loading, Pre-Processing and Modification  Compensating image defects and making the document suited for automatic OCR Step 2. Document Layout Analysis  Layout analysis, detection of document sections like text, images and barcodes Step 3. (Optical) Character Recognition  Automatic recognition of characters, apply selected recognition languages & dictionaries Step 4. (optional) Verification - by Operators or automated post correction  Manual validation of suspicious characters and words Step 5. Document Synthesis and Export  Generating an output document in the selected formatABBYY & OCR for IMPACT 13
  14. 14. Step 1: Image pre-processingABBYY & OCR for IMPACT 14
  15. 15. Step 1: Image pre-processingImage Loading, Pre-Processing and Modification Intelligent background filtering Adaptive Binarisation General binarisation on an image level can not deliver good results for OCRABBYY & OCR for IMPACT 15
  16. 16. Step 1: Image pre-processingNew V10: Binarisation, Textured Background optimisations Original scanV9 binarisation New V10 binarisationABBYY & OCR for IMPACT 16
  17. 17. Step 1: Image pre-processingNew V10: Binarisation, Textured Background optimisations Original scanV9 binarisation V10 binarisationABBYY & OCR for IMPACT 17
  18. 18. Step 1: Image pre-processingNew V10: Binarisation for the IMPACT project  Original  State of Art (V9)  New (V10)  No text from the other page!ABBYY & OCR for IMPACT 18
  19. 19. Step 2: Document Layout AnalysisABBYY & OCR for IMPACT 19
  20. 20. Step 2: Document Layout AnalysisAnalyze layout and find text, images, tables and barcodesABBYY & OCR for IMPACT 20
  21. 21. Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Image/Text detection – Example 1/3 V9 Technology V10 Technology Part of the column was detected as an imageABBYY & OCR for IMPACT 21
  22. 22. Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Word Order Detection– Example 2/3 V9 Technology V10 Technology Less linear word order errorsABBYY & OCR for IMPACT 22
  23. 23. Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Lost text (no Detection) – Example3/3 V9 Technology V10 Technology Less lost textABBYY & OCR for IMPACT 23
  24. 24. Step 2: Document Layout AnalysisSegmentation Improvements: IMPACT Results over time Before IMPACT:  Overall segmentation improvements ● Better picture detection ● Better separators ● Better page layout reconstruction  Only a random set of old newspapers available After IMPACT:  IMPACT Segmentation Ground Truth available  New (internal) DA model for historic newspapers  New segmentation evaluation methodology  Evaluation results on newspapers ● 40% less split/merge errors ● 25% less garbage and lost textABBYY & OCR for IMPACT 24
  25. 25. Step 3: Text/Character RecognitionABBYY & OCR for IMPACT 25
  26. 26. Step 3: Text/Character Recognition Samples for Classifiers used in ABBYY technologies After line detection, character recognition is applied with different classifiers Raster classifier Contour classifier Structure classifier Feature differentiating classifierABBYY & OCR for IMPACT 26
  27. 27. Step 3: Text/Character RecognitionOptimization and new Developments Improved Gothic Classifiers  A significant amount of time was invested in gothic classifier training  The library selection of ground truth material (historical relevance) was used  New gothic graphemes were added Results  Good quality images: 2.8% (total) error rate on the used test set which is about 20% improvement to the “state of art” (V9) = almost comparable to modern documents  Bad quality Images: 7% (total) error rate on the used test set which is about 30% improvement to the “state of art” (V9)  Most of the improvements available in ABBYY current products: ABBYY FineReader Engine 10 (SDK) & Recognition Server 3.0 Quality optimization will be continued in future releases and technology cycles optimizedABBYY & OCR for IMPACT 27
  28. 28. Step 3: Text/Character RecognitionOptimization and new Developments Old Slavonic as new OCR Language New Development Before NowABBYY & OCR for IMPACT 28
  29. 29. Quality-Test-Comparison: Binarisation & Recognition ImprovementsABBYY & OCR for IMPACT 29
  30. 30. Binarisation & Recognition Improvements How to evaluate the recognition improvements of binarisation?  Binarisation & recognition quality go hand in hand! -> # Errors = 100% with V9 binarisation & V9 recognition -> # Errors = -5% with V9 binarisation & V10 recognition -> # Errors = -11% with V10 binarisation & V9 recognition -> # Errors = -15% with V10 binarisation & V10 recognition Binarisation Recognition TechnologyABBYY & OCR for IMPACT 30
  31. 31. Step 3-5: Dictionaries & ExportABBYY & OCR for IMPACT 31
  32. 32. Step 3 – 5: Other Optimizations External Dictionary API Tuning  External Dictionary API was available in the FineReader Engine (SDK)  Support for any language, any time period  API was/is heavily used from IMPACT language partners to run quality tests New ALTO XML Export Formats  FineReader Engine 10 R2, December 2010  Recognition Server 3.0, July 2011ABBYY & OCR for IMPACT 32
  33. 33. Additional NotesABBYY & OCR for IMPACT 33
  34. 34. Further Information & Trial Versions The ABBYY Gothic/Fraktur OCR Portal: www.frakturschrift.comABBYY & OCR for IMPACT 34
  35. 35. What IMPACT taught ABBYY aboutLibraries & Mass Digitalization projects… The Reality  Masses of books/document are available & already scanned  It is unclear if Antiqua and/or Gothic/Fraktur fonts are used in the documents  Pre-Sorting is impossible, it would be too time/cost expensive ABBYY Europes Answer Reduced the pricing for mixed “Old” + “Modern” font OCR projects The pricing is now ready for “mass processing” Examples Recognition Server 3.0 with “Gothic” enabled  10.000 pages – 299 Euro – available online  500.000 pages* – 5.000 Euro = 1 Euro cent per page = ca 2.000 books a 250 pages  Over 3 Mio pages* - ca 0,52 Euro cent per page = 12.000 books a 1,25 € (250 pages)  Over 10 Mio pages* - ca. 40.000 books = ca. 0,5 € per book ... No more excuses for not A4, bigger formats are counted as multiple pages 35ABBYY & OCR for IMPACT * page size is OCRing :o)
  36. 36. Pre-AnnouncementABBYY Online OCR Services with Gothic/Fraktur The ABBYY Gothic/Fraktur OCR Portal: finereader.abbyyonline.com  Historic OCR added just last week  Web GUI to upload documents and get results  Simple to use  Low Volume, ad hoc Usage  Instant results, quality evaluation  Pay as you go ABBYY Online OCR SDK  OCR Service with API and XML Output  Runs on Windows Azure  Currently Closed Beta Test  Public Beta Test Q1/2012ABBYY & OCR for IMPACT 36
  37. 37. SummaryABBYY & OCR for IMPACT 37
  38. 38. The whole is greater than the sum of its parts (Aristotle)ABBYY & OCR for IMPACT 38
  39. 39. Thank you for your attention! Questions? Michael Fuchs Senior Product Marketing Manager ABBYY Europe fuchs@abbyy.comABBYY & OCR for IMPACT 39

×