IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT OCR in a nutshell
Clemens Neudecker, National Library of the Netherlands
IMPACT Demo Day, Biblioteca Nacional de España
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




OCR Process
        Binarisation
        = transform greyscale or colour images to bitonal (b/w)
        in order to separate foreground (text) from background

        Segmentation
        = detection of layout elements in hierarchical order
        (blocks/regions, lines, words, glyphs)

        Pattern Matching (Recognition)
        = matching of character shapes with internal font
        database (classifiers)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




ABBYY FineReader
        Main OCR technology provider in IMPACT
        OCR technologies experts since 30 years
        IMPACT uses FineReader Engine (SDK)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Binarisation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Adaptive Binarisation

                                                                                                                    Original scan




              Prev.
           binarization


                                      New
                                   binarization
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT Binarisation

            Original                                                  State of the Art                                                             IMPACT




                                                                                                                                                            6
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Segmentation




         Blocks/Regions                                                                    Words                                                Glyphs
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT Segmentation example
         Pre-Impact FR Engine 9                                                                                   FR Engine 10




                                  Part of column was misclassified as image

                                                                                                                                                          8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT Segmentation example
                                        v. 9                                                                                              v. 10




                                                                   Linear word order errors
                                                                                                                                                          9
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




IMPACT Segmentation example
                             v. 9                                                                                               v. 10




                                                                           Lost text

                                                                                                                                                          10
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Fraktur recognition
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




Languages and Dictionaries
 Goal:
  • Develop an interface so that external dictionaries can
    be integrated into the FineReader Engine


 2008 - 2009:
  • External Dictionary beta interface
  • Same quality as with internal dictionaries possible


 2010 - 2011:
  • Make interface work reliably
  • Teach partners how to use it
  • Support for any language, any time period

                                                                                                                                                          12
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




ALTO: New native export format




 Available since FRE 10 R2
 Supports most recent schema: ALTO v. 2.0
 Line coordinates available
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.




                                       Thank you! Questions?

IMPACT OCR in a nutshell. Clemens Neudecker

  • 1.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT OCR in a nutshell Clemens Neudecker, National Library of the Netherlands IMPACT Demo Day, Biblioteca Nacional de España
  • 2.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. OCR Process Binarisation = transform greyscale or colour images to bitonal (b/w) in order to separate foreground (text) from background Segmentation = detection of layout elements in hierarchical order (blocks/regions, lines, words, glyphs) Pattern Matching (Recognition) = matching of character shapes with internal font database (classifiers)
  • 3.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. ABBYY FineReader Main OCR technology provider in IMPACT OCR technologies experts since 30 years IMPACT uses FineReader Engine (SDK)
  • 4.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Binarisation
  • 5.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Adaptive Binarisation Original scan Prev. binarization New binarization
  • 6.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Binarisation Original State of the Art IMPACT 6
  • 7.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Segmentation Blocks/Regions Words Glyphs
  • 8.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Segmentation example Pre-Impact FR Engine 9 FR Engine 10 Part of column was misclassified as image 8
  • 9.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Segmentation example v. 9 v. 10 Linear word order errors 9
  • 10.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. IMPACT Segmentation example v. 9 v. 10 Lost text 10
  • 11.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Fraktur recognition
  • 12.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Languages and Dictionaries Goal: • Develop an interface so that external dictionaries can be integrated into the FineReader Engine 2008 - 2009: • External Dictionary beta interface • Same quality as with internal dictionaries possible 2010 - 2011: • Make interface work reliably • Teach partners how to use it • Support for any language, any time period 12
  • 13.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. ALTO: New native export format Available since FRE 10 R2 Supports most recent schema: ALTO v. 2.0 Line coordinates available
  • 14.
    IMPACT is supportedby the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Thank you! Questions?