Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

IMPACT Final Conference - USAL - Text line and word segmentation


Published on

IMPACT Final Conference - USAL - Text line and word segmentation

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

IMPACT Final Conference - USAL - Text line and word segmentation

  1. 1. IMPACT Research Image Enhancement, Segmentation, Experimental OCR Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom
  2. 2. Outline <ul><li>Overview: digitisation workflow </li></ul><ul><li>Image enhancement </li></ul><ul><ul><li>Border removal </li></ul></ul><ul><ul><li>Page curl removal </li></ul></ul><ul><ul><li>Correction of arbitrary warping </li></ul></ul><ul><li>Segmentation </li></ul><ul><ul><li>Recognition-based </li></ul></ul><ul><ul><li>Standalone </li></ul></ul><ul><li>Typewritten document OCR </li></ul><ul><li>Wordspotting </li></ul>
  3. 3. Overview: Digitisation Workflow <ul><li>Main steps: </li></ul><ul><ul><li>Scanning </li></ul></ul><ul><ul><li>Image enhancement </li></ul></ul><ul><ul><ul><ul><li>Page splitting </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Border removal </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Page curl removal </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Dewarping </li></ul></ul></ul></ul><ul><ul><li>Layout analysis </li></ul></ul><ul><ul><ul><ul><li>Segmentation of regions, lines, words and characters </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Region classification </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Logical layout analysis </li></ul></ul></ul></ul><ul><ul><li>OCR (incl. specialist or wordspotting) </li></ul></ul><ul><ul><li>Post-processing </li></ul></ul>
  4. 4. Textline and Word Segmentation <ul><li>Standalone methods that can be integrated to systems without the need to integrate FR engine </li></ul><ul><li>Not based on recognition of characters/words – suitable for documents with non-dictionary words or not practical to OCR to OCR (word spotting) </li></ul><ul><li>Used in other IMPACT methods: </li></ul><ul><ul><li>Typewritten OCR </li></ul></ul><ul><ul><li>Correction of arbitrary warping </li></ul></ul><ul><ul><li>Word spotting </li></ul></ul>date footertext
  5. 5. Hybrid Text Line Segmenter <ul><li>Hybrid approach based on connected component clustering and projection profiles </li></ul><ul><li>Connected component extraction (incl. noise filtering) </li></ul><ul><li>Group components into line candidates using an efficient data structure </li></ul><ul><li>Find and split under-segmented lines using local projection profiles </li></ul><ul><li>Merge small peripheral lines to appropriate neighbour (e.g. for i-dots etc.) </li></ul>Bitonal image Text regions (PAGE XML) Regions with text lines (PAGE XML) Parameters
  6. 6. Density Word Segmenter <ul><li>Adaptive projection-profile based approach using foreground pixel density </li></ul>Bitonal image Text regions and lines (PAGE XML) Regions, text lines and words (PAGE XML) Parameters <ul><li>For each text line: </li></ul><ul><ul><li>Generate vertical projection profile </li></ul></ul><ul><ul><li>Find delimiting white spaces using an adaptive threshold based on the density of foreground pixels in the line </li></ul></ul><ul><ul><li>Group connected components into words </li></ul></ul>
  7. 7. Evaluation <ul><li>Text line ground truth: 25 historical documents (more than 2700 text lines) </li></ul><ul><li>Results (using USAL layout evaluation tool): </li></ul><ul><li>Word ground truth: 15 historical documents (more than 14500 words) </li></ul><ul><li>Results (using USAL layout evaluation tool): </li></ul>
  8. 8. Further Information <ul><li>PRImA </li></ul><ul><ul><li> </li></ul></ul><ul><li>IMPACT </li></ul><ul><ul><li> </li></ul></ul>