A Journey Into the Emotions of Software Developers
IMPACT OCR in a nutshell. Clemens Neudecker
1. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT OCR in a nutshell
Clemens Neudecker, National Library of the Netherlands
IMPACT Demo Day, Biblioteca Nacional de España
2. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR Process
Binarisation
= transform greyscale or colour images to bitonal (b/w)
in order to separate foreground (text) from background
Segmentation
= detection of layout elements in hierarchical order
(blocks/regions, lines, words, glyphs)
Pattern Matching (Recognition)
= matching of character shapes with internal font
database (classifiers)
3. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
ABBYY FineReader
Main OCR technology provider in IMPACT
OCR technologies experts since 30 years
IMPACT uses FineReader Engine (SDK)
4. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Binarisation
5. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Adaptive Binarisation
Original scan
Prev.
binarization
New
binarization
6. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Binarisation
Original State of the Art IMPACT
6
7. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Segmentation
Blocks/Regions Words Glyphs
8. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Segmentation example
Pre-Impact FR Engine 9 FR Engine 10
Part of column was misclassified as image
8
9. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Segmentation example
v. 9 v. 10
Linear word order errors
9
10. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Segmentation example
v. 9 v. 10
Lost text
10
11. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Fraktur recognition
12. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Languages and Dictionaries
Goal:
• Develop an interface so that external dictionaries can
be integrated into the FineReader Engine
2008 - 2009:
• External Dictionary beta interface
• Same quality as with internal dictionaries possible
2010 - 2011:
• Make interface work reliably
• Teach partners how to use it
• Support for any language, any time period
12
13. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
ALTO: New native export format
Available since FRE 10 R2
Supports most recent schema: ALTO v. 2.0
Line coordinates available
14. IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Thank you! Questions?