Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

2,536 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,536
On SlideShare
0
From Embeds
0
Number of Embeds
377
Actions
Shares
0
Downloads
50
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Bratislava WS - Fuchs - Abbyy - OCR overview_pdf

  1. 1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Optical Character Recognition (OCR) Introduction & Overview Michael Fuchs Senior Product Marketing Manager ABBYY Europe fuchs@abbyy.com
  2. 2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Agenda  ABBYY Technology in the IMPACT project  Who is ABBYY?  Company Overview  Product Overview  How is OCR used in real-life scenarios?  Optical Character Recognition - Basics  What is OCR?  How does OCR work inside?  OCR = Only Character Recognition?  IMPACT – the areas of improvement  Questions & Answers IMPACT + ABBYY - OCR Introduction & Overview 2
  3. 3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT & ABBYY IMPACT + ABBYY - OCR Introduction & Overview 3
  4. 4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Improving Access to Text  Mission of IMPACT: It aims to significantly improve access to historical text and remove the barriers that stand in the way of the mass digitisation of the European cultural heritage.  Partners: Koninklijke Bibliotheek, The British Library, Österreichische Nationalbibliothek, Universität Innsbruck, Deutsche Nationalbibliothek, Bayerische Staatsbibliothek, Staats- und Universitätsbibliothek Göttingen ABBYY, IBM Israel – Science and Technology Ltd, Instituut voor Nederlandse Lexicologie National Centre for Scientific Research "Demokritos“, Centrum für Informations- und Sprachverarbeitung, University of Munich University of Bath, University of Salford, Bibliothèque Nationale de France  Web: www.impact-project.eu IMPACT + ABBYY - OCR Introduction & Overview 4
  5. 5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT & ABBYY  ABBYY is the OCR technology provider for IMPACT members  IMPACT members work with ABBYYs OCR SDK (FineReader Engine), because:  Only development toolkits allow developers to combine new/different modules, for example: complex dictionaries  Scientific research & tests have to be implemented in custom modules IMPACT + ABBYY - OCR Introduction & Overview 5
  6. 6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT & ABBYY  ABBYY improves the OCR core technologies for the recognition of old documents, current focus areas are  Image pre-processing  Character recognition  IMPACT currently focuses on research and not in setting up a production system ;o)  Improvements in ABBYY recognition technologies that are a result of the IMPACT project will be added to future products  Important: ABBYY FineReader 8/9/10 Professional (Box) has NO Fraktur OCR  Fraktur OCR is only available in Recognition Server und FineReader Engine IMPACT + ABBYY - OCR Introduction & Overview 6
  7. 7. ABBYY – an Overview IMPACT + ABBYY – OCR Introduction & Overview
  8. 8. Who is ABBYY?  Leading developer of artificial intelligence software in document recognition, data capture and linguistics  Headquartered in Moscow, Russia  Founded in 1989 by Mr. David Yang as BIT Software  More than 880 employees worldwide  8 offices worldwide  Established sales and distribution network in more than 130 countries worldwide ABBYY & OCR for IMPACT
  9. 9. ABBYY Worldwide ABBYY Headquarters/ ABBYY Russia ABBYY Europe UK Moscow Fremont ABBYY USA ABBYY Europe GmbH ABBYY Ukraine ABBYY Japan Munich, Germany Kiev ABBYY Taiwan ABBYY & OCR for IMPACT
  10. 10. ABBYY in Western Europe ABBYY Europe GmbH  Located in Munich, Germany  Established in 2001  Serves partners and customers in Western European countries  Sales and Marketing  Sales ● Distribution, channel development, partner management  Marketing ● Product marketing, channel marketing, outbound marketing (PR, advertising, direct)  More than 50 employees today ABBYY & OCR for IMPACT
  11. 11. Product Overview ABBYY & OCR for IMPACT
  12. 12. ABBYY Product Brands Mainline Distribution “Box” products:  ABBYY FineReader Optical character recognition (OCR)/text processing end user products  ABBYY FotoReader Conversion of texts taken with digital cameras  ABBYY PDF Transformer PDF conversion and creation for end users  ABBYY Lingvo Electronic dictionaries, Russian and European languages ABBYY & OCR for IMPACT
  13. 13. ABBYY Product Brands Direct Sales and VAR Distribution Licensing and integration products:  ABBYY Recognition Server Server-based OCR  ABBYY FormReader and ABBYY FlexiCapture Form processing, unstructured document processing, document assembly  ABBYY FineReader Engine SDK Comprehensive toolkit for integrating recognition and data capture technologies into third-party applications  ABBYY Mobile OCR Engine OCR for thin clients such as mobile phones, PDAs and Web applications ABBYY & OCR for IMPACT
  14. 14. ABBYY OCR Products – Usage View Desktop/Workgroup Server/Backend SDK/Integration User driven processing, Automated processing, Automated processing, Ready to use Ready to use Development needed OCR & Document FineReader Recognition Server FineReader Engines Conversion (Professional, Corporate, (Professional, Extended Edition) (Windows, Linux, Mac OS X, Site Licence Edition) Free BSD, Embedded Systems) PDF Transformer Mobile OCR Engine FotoReader (Android, Symbian, Linux, Windows, Windows Mobile, ScreenshotReader iPhone ) End Users, Companies, Developers, Users are: Companies, Scan Service Provider, Scan Service Provider (Libraries) Libraries IMPACT Research ABBYY & OCR for IMPACT
  15. 15. OCR Basics ABBYY & OCR for IMPACT
  16. 16. Designed to be not OCRed ABBYY & OCR for IMPACT 10
  17. 17. What (ABBYY) OCR can read...  Recognition Languages  >191 languages altogether  Alphabets: Cyrillic, Latin, Greek, Armenian, Hebrew, Thai  34 languages with dictionary support and spell check  Chinese, Japanese, Korean (CJK) - 4 sets of hieroglyphs (Chinese (traditional and simplified), Japanese, Korean)  5 languages in FineReader XIX (Gothic and other 17-20 century fonts)  6 programming languages (Basic, C/C++, COBOL, Java, etc.)  4 artificial languages (Esperanto, Interlingua, etc.)  Simple chemical formulas  Font Types  Recognition of mixed font types (dot-matrix printer, typewriter, Gothic, etc.)  OCR-A  OCR-B  MICR (E13B)  CMC-7 ABBYY & OCR for IMPACT 11
  18. 18. OCR Processing Steps  Step 1. Scanning, Image Loading, Pre-Processing and Modification  Compensating image defects and making the document better viewable and suited for automatic OCR  Step 2. Document Layout Analysis  Detect sections of a document, analyze layout and find barcodes  Step 3. Character Recognition  Automatic recognition of characters, apply selected recognition languages, dictionaries and other settings  Step 4. Verification by Operators (optional)  Manual validation of suspicious characters and words  Step 5. Document Synthesis and Export  Generating an output document in the selected format ABBYY & OCR for IMPACT 12
  19. 19. OCR Processing Steps  Step 1. Image Loading, Pre-Processing and Modification Images from existing files or captured with a scanner  Splitting images  Scaling (e.g. low resolution images can be digitally magnified)  Rotation (on 90, 180, or 270 degrees)  Flipping and inverting images  Cropping (selecting rectangular areas)  Creating previews (small images for previews)  Changing text colour and background in rectangular areas ABBYY & OCR for IMPACT 13
  20. 20. ABBYY OCR Processing Steps  Step 1. Image Loading, Pre-Processing and Modification Compensating for scanning defects  Automatic de-skew to proper straight position  Straightening text lines  Controlled de-speckle (cleaning garbage dots) ABBYY & OCR for IMPACT 14
  21. 21. OCR Processing Steps  Step 1. Image Loading, Pre-Processing and Modification  Intelligent background filtering  Adaptive Binarisation General binarisation on an image level can not deliver good results for OCR ABBYY & OCR for IMPACT 15
  22. 22. OCR Processing Steps  Step 1. Image Loading, Pre-Processing and Modification  Success during IMPACT  Original  State of Art  New  No text from the other page ABBYY & OCR for IMPACT 16
  23. 23. New Binarization Examples Original scan Prev. binarization New binarization ABBYY & OCR for IMPACT 23
  24. 24. Camera OCR Automatic correction of 3D perspective distortions Before After ABBYY & OCR for IMPACT 24
  25. 25. Camera OCR ISO noise reduction Before After ABBYY & OCR for IMPACT 25
  26. 26. OCR Processing Steps  Step 2. Document Layout Analysis Detecting sections of a document, analyze layout and find barcodes ABBYY & OCR for IMPACT 20
  27. 27. OCR Processing Steps  Step 3. Character Recognition After line detection, character recognition is applied with different classifiers Raster classifier Contour classifier Structure classifier Feature differentiating classifier ABBYY & OCR for IMPACT 21
  28. 28. OCR Optimization  Step 3. Character Recognition – learn new symbols Own Pattern Training to learn special characters on a pixel level ABBYY & OCR for IMPACT 22
  29. 29. OCR Optimization  Step 3. Character Recognition – back to the word level Applying selected recognition languages and dictionaries  Own languages and dictionaries can be defined ABBYY & OCR for IMPACT 23
  30. 30. OCR Processing Steps  Step 4. Verification by Operators (optional) Manual validation or correction of  Layout Analysis Results ● Text blocks ● Image blocks ● Table blocks  Suspicious characters and word corrections using dictionaries  Re-Recognition with other language settings  Recognition Server allows one to set quality level and also to log processing results in a XML file ABBYY & OCR for IMPACT 24
  31. 31. ABBYY OCR Processing Steps  Step 5. Document Synthesis and Export Generating an output document in the selected format  TXT, Office formats, PDF, etc.  From version 9.0 on ADRT (Adaptive Document Recognition Technology) included. Goal: Understanding the document structure and detecting e.g. headers, footers, footnotes. V10: table of contents  SDKs and Recognition Server offer more export formats, e.g. ● XML ● Internal FineReader Engine Format ABBYY & OCR for IMPACT 25
  32. 32. OCR in General & IMPACT in Particular ABBYY & OCR for IMPACT
  33. 33. OCR = Only Character Recognition?  Recreates the same layout as in the original document  Resulting document looks just like the scanned original  Information captured during Layout Analysis is used here  Supports popular document formats  ABBYY products support all popular output formats the customer needs PDF, PDF/A, XML, HTML, TXT/CSV, Word, Excel, PowerPoint and DBF  Supports image output  BMP, PCX, JPEG, JPEG 2000, TIFF, PNG  Compliance with the regulations  Support for selective access password protection, document encryption, support for PDF/A format, etc. ABBYY & OCR for IMPACT 27
  34. 34. IMPACT = „Step by Step“ Optimisation  Step 1. Image Quality  Problem areas: Scans of microfilms, distortions, shine through characters  Optimisation approach: Image pre-processing, e.g: Binarisation  Step 2. Document Analysis  Problem areas : Layout of old print material, e.g. narrow columns in old newspapers,  Optimisation approach: improved Layout/Document Analysis  Step 3. Character recognition & Languages  Problem areas : Used Fonts, old language (grammar & spelling)  Optimisation approach: Optimised patterns, adaptive OCR, creation of special dictionaries  Step 4. Validation & Correction  Problem areas : often recurring errors during Fraktur OCR, Scalability of correction  Optimisation approach: New approaches for mass verification  Step 5. Document Synthesises, Export & Rating  Problem areas : Content classification, Meta data generation, “reliable ”formats  Optimisation approach: XML, AltoXML, XML analysis, PDF/A, … ABBYY & OCR for IMPACT 28
  35. 35. Thank you for your attention! Questions? Michael Fuchs Senior Product Marketing Manager ABBYY Europe fuchs@abbyy.com ABBYY & OCR for IMPACT

×