Your SlideShare is downloading. ×
0
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
IMPACT Final Conference - Michael Fuchs
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

IMPACT Final Conference - Michael Fuchs

2,000

Published on

ABBYY FineReader: IMPACT Improvements with Michael Fuchs from ABBYY Europe

ABBYY FineReader: IMPACT Improvements with Michael Fuchs from ABBYY Europe

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,000
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
28
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of theNetherlands.ABBYY & OCR Improvements for IMPACT Michael Fuchs Senior Product Marketing Manager ABBYY Europe fuchs@abbyy.com
  • 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Li brary of theNetherlands.Agenda  Who is ABBYY?  Company Overview  (Short) Product Overview  ABBYY Technology in the IMPACT project  OCR & Processing – IMPACT improvements  Binarisation, Segmentation,  Recognition  Dictionary API, Export Formats  Lessons Learned, Pricing, Pre-Announcement, Q&A 2
  • 3. ABBYY & IMPACTABBYY & OCR for IMPACT 3
  • 4. ABBYY Group Overview ABBYY Group  Founded in 1989 as BIT Software  > 1000 employees in 14 offices worldwide  Headquarters/R&D in Moscow, RussiaABBYY & OCR for IMPACT 4
  • 5. ABBYY OCR Products – Usage View Desktop/Workgroup Server/Backend SDK/Integration User driven processing, Automated processing, Automated processing, Ready to use Ready to use Development neededOCR & Document Conversion FineReader Recognition Server FineReader Engines (Professional, Corporate, (Professional, Extended (Windows, Linux, Mac OS X, Site Licence Edition) Edition) Free BSD, Embedded Systems) Note: No Gothic/Fraktur OCR! Gothic/Fraktur OCR Mobile OCR Engine & XML Export (Android, Symbian, Linux, PDF Transformer Support! Windows, Windows Mobile, FotoReader iOS ) ScreenshotReader End Users, Companies, Developers,Users are: Companies, Scan Service Provider, Scan Service Provider (Libraries) Libraries IMPACT Research ABBYY & OCR for IMPACT 5
  • 6. What (ABBYY) OCR can read... Recognition Languages  Almost 200 OCR languages  34 languages with dictionary support and spell check  Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai  Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs (Chinese (traditional and simplified), Japanese, Korean)  Arabic (Technical Preview in the SDK) Font Types  Recognition of mixed font types (dot-matrix printer, typewriter, Gothic, etc.)  OCR-A  OCR-B  MICR (E13B)  CMC-7ABBYY & OCR for IMPACT 6
  • 7. IMPACT & ABBYY ABBYY is the OCR technology provider for IMPACT members ABBYY also improved the core technologies for the recognition of old documents in IMPACT, focus areas are/were:  Image pre-processing  Segmentation  Character recognition  Export IMPACT members work with the Software Development Kit (SDK) FineReader Engine – not the desktop application IMPACT focus is/was on research and not in setting up a production system ;o) Improved technologies are/will be added to current/future productsABBYY & OCR for IMPACT 7
  • 8. Designed to be not OCRedABBYY & OCR for IMPACT 8
  • 9. Why ABBYY? - OCR … Original Image [perfect quality :o) ] Std. OCR * ABBYY Fraktur OCR* *Recognition Server 3.0 R1 – Gothic/Fraktur disabled and enabled ABBYY & OCR for IMPACT 9
  • 10. ABBYY “History” and Old Fonts Recognition FineReader XIX (V7 Technology) 2003 (METAe result 2000-2003) FineReader Engine 9.0 (Release 1) 2008 (Pre-IMPACT – “State of the Art”) FineReader Engine 10 2010 IMPACT Project OptimizationsABBYY & OCR for IMPACT 10
  • 11. ABBYY and Old European FontsAccuracy Comparison: Up to 98,2 % on good quality images 2003 2008 2010ABBYY Technology Version 10 recognition of old European fonts: 25% more accurate than FRE 9.0 38% more accurate than FR XIX ABBYY & OCR for IMPACT 11
  • 12. OCR Processing Steps & ABBYY Improvements for IMPACTABBYY & OCR for IMPACT 12
  • 13. Processing Steps Step 1. Scanning, Image Loading, Pre-Processing and Modification  Compensating image defects and making the document suited for automatic OCR Step 2. Document Layout Analysis  Layout analysis, detection of document sections like text, images and barcodes Step 3. (Optical) Character Recognition  Automatic recognition of characters, apply selected recognition languages & dictionaries Step 4. (optional) Verification - by Operators or automated post correction  Manual validation of suspicious characters and words Step 5. Document Synthesis and Export  Generating an output document in the selected formatABBYY & OCR for IMPACT 13
  • 14. Step 1: Image pre-processingABBYY & OCR for IMPACT 14
  • 15. Step 1: Image pre-processingImage Loading, Pre-Processing and Modification Intelligent background filtering Adaptive Binarisation General binarisation on an image level can not deliver good results for OCRABBYY & OCR for IMPACT 15
  • 16. Step 1: Image pre-processingNew V10: Binarisation, Textured Background optimisations Original scanV9 binarisation New V10 binarisationABBYY & OCR for IMPACT 16
  • 17. Step 1: Image pre-processingNew V10: Binarisation, Textured Background optimisations Original scanV9 binarisation V10 binarisationABBYY & OCR for IMPACT 17
  • 18. Step 1: Image pre-processingNew V10: Binarisation for the IMPACT project  Original  State of Art (V9)  New (V10)  No text from the other page!ABBYY & OCR for IMPACT 18
  • 19. Step 2: Document Layout AnalysisABBYY & OCR for IMPACT 19
  • 20. Step 2: Document Layout AnalysisAnalyze layout and find text, images, tables and barcodesABBYY & OCR for IMPACT 20
  • 21. Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Image/Text detection – Example 1/3 V9 Technology V10 Technology Part of the column was detected as an imageABBYY & OCR for IMPACT 21
  • 22. Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Word Order Detection– Example 2/3 V9 Technology V10 Technology Less linear word order errorsABBYY & OCR for IMPACT 22
  • 23. Step 2: Document Layout Analysis (old Newspapers)Segmentation Improvements: Lost text (no Detection) – Example3/3 V9 Technology V10 Technology Less lost textABBYY & OCR for IMPACT 23
  • 24. Step 2: Document Layout AnalysisSegmentation Improvements: IMPACT Results over time Before IMPACT:  Overall segmentation improvements ● Better picture detection ● Better separators ● Better page layout reconstruction  Only a random set of old newspapers available After IMPACT:  IMPACT Segmentation Ground Truth available  New (internal) DA model for historic newspapers  New segmentation evaluation methodology  Evaluation results on newspapers ● 40% less split/merge errors ● 25% less garbage and lost textABBYY & OCR for IMPACT 24
  • 25. Step 3: Text/Character RecognitionABBYY & OCR for IMPACT 25
  • 26. Step 3: Text/Character Recognition Samples for Classifiers used in ABBYY technologies After line detection, character recognition is applied with different classifiers Raster classifier Contour classifier Structure classifier Feature differentiating classifierABBYY & OCR for IMPACT 26
  • 27. Step 3: Text/Character RecognitionOptimization and new Developments Improved Gothic Classifiers  A significant amount of time was invested in gothic classifier training  The library selection of ground truth material (historical relevance) was used  New gothic graphemes were added Results  Good quality images: 2.8% (total) error rate on the used test set which is about 20% improvement to the “state of art” (V9) = almost comparable to modern documents  Bad quality Images: 7% (total) error rate on the used test set which is about 30% improvement to the “state of art” (V9)  Most of the improvements available in ABBYY current products: ABBYY FineReader Engine 10 (SDK) & Recognition Server 3.0 Quality optimization will be continued in future releases and technology cycles optimizedABBYY & OCR for IMPACT 27
  • 28. Step 3: Text/Character RecognitionOptimization and new Developments Old Slavonic as new OCR Language New Development Before NowABBYY & OCR for IMPACT 28
  • 29. Quality-Test-Comparison: Binarisation & Recognition ImprovementsABBYY & OCR for IMPACT 29
  • 30. Binarisation & Recognition Improvements How to evaluate the recognition improvements of binarisation?  Binarisation & recognition quality go hand in hand! -> # Errors = 100% with V9 binarisation & V9 recognition -> # Errors = -5% with V9 binarisation & V10 recognition -> # Errors = -11% with V10 binarisation & V9 recognition -> # Errors = -15% with V10 binarisation & V10 recognition Binarisation Recognition TechnologyABBYY & OCR for IMPACT 30
  • 31. Step 3-5: Dictionaries & ExportABBYY & OCR for IMPACT 31
  • 32. Step 3 – 5: Other Optimizations External Dictionary API Tuning  External Dictionary API was available in the FineReader Engine (SDK)  Support for any language, any time period  API was/is heavily used from IMPACT language partners to run quality tests New ALTO XML Export Formats  FineReader Engine 10 R2, December 2010  Recognition Server 3.0, July 2011ABBYY & OCR for IMPACT 32
  • 33. Additional NotesABBYY & OCR for IMPACT 33
  • 34. Further Information & Trial Versions The ABBYY Gothic/Fraktur OCR Portal: www.frakturschrift.comABBYY & OCR for IMPACT 34
  • 35. What IMPACT taught ABBYY aboutLibraries & Mass Digitalization projects… The Reality  Masses of books/document are available & already scanned  It is unclear if Antiqua and/or Gothic/Fraktur fonts are used in the documents  Pre-Sorting is impossible, it would be too time/cost expensive ABBYY Europes Answer Reduced the pricing for mixed “Old” + “Modern” font OCR projects The pricing is now ready for “mass processing” Examples Recognition Server 3.0 with “Gothic” enabled  10.000 pages – 299 Euro – available online  500.000 pages* – 5.000 Euro = 1 Euro cent per page = ca 2.000 books a 250 pages  Over 3 Mio pages* - ca 0,52 Euro cent per page = 12.000 books a 1,25 € (250 pages)  Over 10 Mio pages* - ca. 40.000 books = ca. 0,5 € per book ... No more excuses for not A4, bigger formats are counted as multiple pages 35ABBYY & OCR for IMPACT * page size is OCRing :o)
  • 36. Pre-AnnouncementABBYY Online OCR Services with Gothic/Fraktur The ABBYY Gothic/Fraktur OCR Portal: finereader.abbyyonline.com  Historic OCR added just last week  Web GUI to upload documents and get results  Simple to use  Low Volume, ad hoc Usage  Instant results, quality evaluation  Pay as you go ABBYY Online OCR SDK  OCR Service with API and XML Output  Runs on Windows Azure  Currently Closed Beta Test  Public Beta Test Q1/2012ABBYY & OCR for IMPACT 36
  • 37. SummaryABBYY & OCR for IMPACT 37
  • 38. The whole is greater than the sum of its parts (Aristotle)ABBYY & OCR for IMPACT 38
  • 39. Thank you for your attention! Questions? Michael Fuchs Senior Product Marketing Manager ABBYY Europe fuchs@abbyy.comABBYY & OCR for IMPACT 39

×