Datech2014-Session1-Document Representation Refinement for Precise Region Description


Published on

Slides of the presentation of the paper Document Representation Refinement for Precise Region Description by Christian Clausner, Stefan Pletschacher and Apostolos Antonacopoulos. #digidays

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Datech2014-Session1-Document Representation Refinement for Precise Region Description

  1. 1. Document Representation Refinement for Precise Region Description Christian Clausner, Stefan Pletschacher and Apostolos Antonacopoulos PRImA Lab, School of Computing, Science and Engineering, University of Salford, United Kingdom
  2. 2. Document Page Regions DATeCH 2014 2 Segmentation, Classification • Region (block, zone): Connected area of a document image with content of a single specific type • Examples: Text, graphic, table
  3. 3. Region Representation • By geometric objects – Bounding box – Stack of rectangles – Polygon • By pixels – Bitmap – Run-length encoding DATeCH 2014 3
  4. 4. Need for Precise Region Descriptions • Precise description is crucial for all but the most trivial document analysis and recognition applications • For performance evaluation: The loss of quality introduced by imprecise regions can be bigger than the variation of accuracy of the actual recognition method DATeCH 2014 4
  5. 5. The Situation • Trend to more precise descriptions, but… • Output of state-of-the-artOCR systems: – Stacks of rectangles (ABBYY FineReader Engine 11) – Bounding boxes (Tesseract OCR 3.02) • Popular formats for layout analysis and OCR results: – ALTO XML (boxes, ellipses, polygons (region level only)) – FineReader XML (stacks of rectangles (region level only)) – PAGE XML (polygons for all levels) – HOCR (boxes) DATeCH 2014 5
  6. 6. Refinement through Polygonal Fitting • Applicable to regions that have child objects in the document model • A typical object hierarchy contains regions, text lines, words and glyphs (characters) • Idea: Tightly wrap a polygon around the child objects DATeCH 2014 6
  7. 7. Polygonal Fitting Approach 1. Create bitmasks for the child objects and transfer them to an empty bitmap 2. Fill the gaps between the child objects by a smearing approach 3. Optional: Exclude neighbour regions 4. Trace the contour of the foreground and create a polygon DATeCH 2014 7
  8. 8. 1 - Transferring Child Object to Bitmap • Starting point: Polygonal object (e.g. text line, word, or glyph) • Lossless conversion to rectangle based interval representation • Transferring the rectangles to the target bitmap DATeCH 2014 8
  9. 9. 2 – Smearing Approach • Goal: Connect all foreground components in the bitmap by filling the gaps in-between 1. Alternatingly fill horizontal and vertical gaps if they are smaller than a dynamic threshold (threshold is increased after each iteration) 2. If necessary, use diagonal smearing to connect remaining components DATeCH 2014 9
  10. 10. 3 – Subtraction of Neighbours • Optional step to avoid overlap with adjacent regions • Simply erase the corresponding pixels from the created bitmap DATeCH 2014 10
  11. 11. 4 – Outline Tracing • Trace the contour of the foreground component in the created bitmap • Create polygon on-the- fly by adding points for each change of direction (corner) DATeCH 2014 11
  12. 12. Experiments • Carried out on a dataset of contemporary documents consisting of scanned magazine and technical article pages • Processed with Tesseract OCR 3.02 (open source) • Exported to PAGE XML with and without refinement DATeCH 2014 12
  13. 13. DATeCH 2014 13 Original (unrefined) Refined
  14. 14. Results • Measurement of region overlaps (number and area) DATeCH 2014 14 Overlapping Regions Overlap Area (Megapixel) Original Outlines 621 (45.8%) 19.9 Refined Outlines 286 (21.1%) 2.5
  15. 15. Impact on Performance Evaluation • Real-world scenario • Measure the performance of Tesseract OCR engine • Evaluation metrics of previous ICDAR page segmentation competitions DATeCH 2014 15 Average success rate using originaloutlines 81.1% Average success rate using refined outlines 84.5% Average improvementfor all documents 3.4% Maximumimprovement 22.9%
  16. 16. Conclusion • Existing geometric region data can be significantly refined by fitting precise polygons around child objects • Validity and impact on real-world scenarios has been shown • Refinement in performance evaluation helps to eliminate problems that arise from insufficient geometric descriptions → Concentrate on real issues of OCR methods • Positive effect on accuracy of presentation/repurposing systems (highlighting, cropping, article tracking, etc.) • Approach used in Aletheia ground truth editor and result viewer ( DATeCH 2014 16
  17. 17. DATeCH 2014 17