The Effect of Scanning Parameters on OCR Results A Case Study Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org
Outline Background Image selection Methods and procedures Experiments Experiment 1: Colour Vs. greyscale Vs. bitonal Experiment 2: Effects of resolution Experiment 3: Comparison with NLNZ images Conclusions
Background Cost of storage is a real issue for Content Holders Study by Tracy Powell and Gordon Paynter of the National Library of  New Zealand (DLIB 2009) opened a number of questions Aims: Examine the effects of colour in addition to greyscale and bitonal Examine the effects of producing bitonal images in different ways Examine the effects of different resolutions Study the results by image rather than average
Image Selection Qualitative selection Parts of newspaper articles (no layout issues) Variety of newspapers from British Library collection Quality of overall page taken into account Regions of different quality selected from same page Only text regions selected (no graphics present) No additional artefacts (e.g. warping) present
Methods and Procedures Regions marked using  Aletheia  and extracted from the main image as separate PAGE files Text was keyed and represented in PAGE files Selected ( “ standard ” ) colour reduction and binarisation methods were applied ABBYY FineReader Engine 9 used for OCR IMPACT OCR evaluation tool used
Experiment 1: Colour/Grey/Bitonal
Accuracy Variation per Image
Bitonal: Best Algorithm Vs. Scanner
Original with Large Bitonal Variation BL9_r0
Experiment 2: Effects of Resolution
Experiment 3: Examine NLNZ Images
Variations in Quality and Accuracy Other bitonal algorithm better  NLNZ1_r1 Scanner bitonal better  NLNZ4_r0
Conclusions Averages do not give an accurate picture. Different decisions should be taken for different document types Better quality images leave room for improvement (re-OCR), especially when accuracy is far from high 90s% Current OCR systems are not taking advantage of extra quality? Higher quality (at least greyscale) is an investment Perhaps not so high resolution for “routine” material “ Lossy ”  compression is a real option  Better to have a high quality image with an imperceptible “loss” than a perfect low quality image!
Further Information PRImA http://www.primaresearch.org IMPACT http://www.impact-project.eu

IMPACT Final Conference - Apostolos Antonacopoulos

  • 1.
    The Effect ofScanning Parameters on OCR Results A Case Study Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org
  • 2.
    Outline Background Imageselection Methods and procedures Experiments Experiment 1: Colour Vs. greyscale Vs. bitonal Experiment 2: Effects of resolution Experiment 3: Comparison with NLNZ images Conclusions
  • 3.
    Background Cost ofstorage is a real issue for Content Holders Study by Tracy Powell and Gordon Paynter of the National Library of New Zealand (DLIB 2009) opened a number of questions Aims: Examine the effects of colour in addition to greyscale and bitonal Examine the effects of producing bitonal images in different ways Examine the effects of different resolutions Study the results by image rather than average
  • 4.
    Image Selection Qualitativeselection Parts of newspaper articles (no layout issues) Variety of newspapers from British Library collection Quality of overall page taken into account Regions of different quality selected from same page Only text regions selected (no graphics present) No additional artefacts (e.g. warping) present
  • 5.
    Methods and ProceduresRegions marked using Aletheia and extracted from the main image as separate PAGE files Text was keyed and represented in PAGE files Selected ( “ standard ” ) colour reduction and binarisation methods were applied ABBYY FineReader Engine 9 used for OCR IMPACT OCR evaluation tool used
  • 6.
  • 7.
  • 8.
  • 9.
    Original with LargeBitonal Variation BL9_r0
  • 10.
    Experiment 2: Effectsof Resolution
  • 11.
  • 12.
    Variations in Qualityand Accuracy Other bitonal algorithm better NLNZ1_r1 Scanner bitonal better NLNZ4_r0
  • 13.
    Conclusions Averages donot give an accurate picture. Different decisions should be taken for different document types Better quality images leave room for improvement (re-OCR), especially when accuracy is far from high 90s% Current OCR systems are not taking advantage of extra quality? Higher quality (at least greyscale) is an investment Perhaps not so high resolution for “routine” material “ Lossy ” compression is a real option Better to have a high quality image with an imperceptible “loss” than a perfect low quality image!
  • 14.
    Further Information PRImAhttp://www.primaresearch.org IMPACT http://www.impact-project.eu