The Effect of Scanning Parameters on OCR Results A Case Study Apostolos Antonacopoulos PRImA Lab, The University of Salfor...
Outline <ul><li>Background </li></ul><ul><li>Image selection </li></ul><ul><li>Methods and procedures </li></ul><ul><li>Ex...
Background <ul><li>Cost of storage is a real issue for Content Holders </li></ul><ul><li>Study by Tracy Powell and Gordon ...
Image Selection <ul><li>Qualitative selection </li></ul><ul><li>Parts of newspaper articles (no layout issues) </li></ul><...
Methods and Procedures <ul><li>Regions marked using  Aletheia  and extracted from the main image as separate PAGE files </...
Experiment 1: Colour/Grey/Bitonal
Accuracy Variation per Image
Bitonal: Best Algorithm Vs. Scanner
Original with Large Bitonal Variation BL9_r0
Experiment 2: Effects of Resolution
Experiment 3: Examine NLNZ Images
Variations in Quality and Accuracy Other bitonal algorithm better  NLNZ1_r1 Scanner bitonal better  NLNZ4_r0
Conclusions <ul><li>Averages do not give an accurate picture. Different decisions should be taken for different document t...
Further Information <ul><li>PRImA </li></ul><ul><ul><li>http://www.primaresearch.org </li></ul></ul><ul><li>IMPACT </li></...
Upcoming SlideShare
Loading in …5
×

IMPACT Final Conference - Apostolos Antonacopoulos

1,007 views

Published on

Case Study: Scanning Parameters

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,007
On SlideShare
0
From Embeds
0
Number of Embeds
282
Actions
Shares
0
Downloads
17
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

IMPACT Final Conference - Apostolos Antonacopoulos

  1. 1. The Effect of Scanning Parameters on OCR Results A Case Study Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org
  2. 2. Outline <ul><li>Background </li></ul><ul><li>Image selection </li></ul><ul><li>Methods and procedures </li></ul><ul><li>Experiments </li></ul><ul><ul><li>Experiment 1: Colour Vs. greyscale Vs. bitonal </li></ul></ul><ul><ul><li>Experiment 2: Effects of resolution </li></ul></ul><ul><ul><li>Experiment 3: Comparison with NLNZ images </li></ul></ul><ul><li>Conclusions </li></ul>
  3. 3. Background <ul><li>Cost of storage is a real issue for Content Holders </li></ul><ul><li>Study by Tracy Powell and Gordon Paynter of the National Library of New Zealand (DLIB 2009) opened a number of questions </li></ul><ul><li>Aims: </li></ul><ul><ul><li>Examine the effects of colour in addition to greyscale and bitonal </li></ul></ul><ul><ul><li>Examine the effects of producing bitonal images in different ways </li></ul></ul><ul><ul><li>Examine the effects of different resolutions </li></ul></ul><ul><ul><li>Study the results by image rather than average </li></ul></ul>
  4. 4. Image Selection <ul><li>Qualitative selection </li></ul><ul><li>Parts of newspaper articles (no layout issues) </li></ul><ul><li>Variety of newspapers from British Library collection </li></ul><ul><li>Quality of overall page taken into account </li></ul><ul><li>Regions of different quality selected from same page </li></ul><ul><li>Only text regions selected (no graphics present) </li></ul><ul><li>No additional artefacts (e.g. warping) present </li></ul>
  5. 5. Methods and Procedures <ul><li>Regions marked using Aletheia and extracted from the main image as separate PAGE files </li></ul><ul><li>Text was keyed and represented in PAGE files </li></ul><ul><li>Selected ( “ standard ” ) colour reduction and binarisation methods were applied </li></ul><ul><li>ABBYY FineReader Engine 9 used for OCR </li></ul><ul><li>IMPACT OCR evaluation tool used </li></ul>
  6. 6. Experiment 1: Colour/Grey/Bitonal
  7. 7. Accuracy Variation per Image
  8. 8. Bitonal: Best Algorithm Vs. Scanner
  9. 9. Original with Large Bitonal Variation BL9_r0
  10. 10. Experiment 2: Effects of Resolution
  11. 11. Experiment 3: Examine NLNZ Images
  12. 12. Variations in Quality and Accuracy Other bitonal algorithm better NLNZ1_r1 Scanner bitonal better NLNZ4_r0
  13. 13. Conclusions <ul><li>Averages do not give an accurate picture. Different decisions should be taken for different document types </li></ul><ul><li>Better quality images leave room for improvement (re-OCR), especially when accuracy is far from high 90s% </li></ul><ul><li>Current OCR systems are not taking advantage of extra quality? </li></ul><ul><li>Higher quality (at least greyscale) is an investment </li></ul><ul><ul><li>Perhaps not so high resolution for “routine” material </li></ul></ul><ul><li>“ Lossy ” compression is a real option </li></ul><ul><li>Better to have a high quality image with an imperceptible “loss” than a perfect low quality image! </li></ul>
  14. 14. Further Information <ul><li>PRImA </li></ul><ul><ul><li>http://www.primaresearch.org </li></ul></ul><ul><li>IMPACT </li></ul><ul><ul><li>http://www.impact-project.eu </li></ul></ul>

×