IMPACT Final Conference - USAL - Text line and word segmentation

1,335
-1

Published on

IMPACT Final Conference - USAL - Text line and word segmentation

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,335
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

IMPACT Final Conference - USAL - Text line and word segmentation

  1. 1. IMPACT Research Image Enhancement, Segmentation, Experimental OCR Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org
  2. 2. Outline <ul><li>Overview: digitisation workflow </li></ul><ul><li>Image enhancement </li></ul><ul><ul><li>Border removal </li></ul></ul><ul><ul><li>Page curl removal </li></ul></ul><ul><ul><li>Correction of arbitrary warping </li></ul></ul><ul><li>Segmentation </li></ul><ul><ul><li>Recognition-based </li></ul></ul><ul><ul><li>Standalone </li></ul></ul><ul><li>Typewritten document OCR </li></ul><ul><li>Wordspotting </li></ul>
  3. 3. Overview: Digitisation Workflow <ul><li>Main steps: </li></ul><ul><ul><li>Scanning </li></ul></ul><ul><ul><li>Image enhancement </li></ul></ul><ul><ul><ul><ul><li>Page splitting </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Border removal </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Page curl removal </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Dewarping </li></ul></ul></ul></ul><ul><ul><li>Layout analysis </li></ul></ul><ul><ul><ul><ul><li>Segmentation of regions, lines, words and characters </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Region classification </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Logical layout analysis </li></ul></ul></ul></ul><ul><ul><li>OCR (incl. specialist or wordspotting) </li></ul></ul><ul><ul><li>Post-processing </li></ul></ul>
  4. 4. Textline and Word Segmentation <ul><li>Standalone methods that can be integrated to systems without the need to integrate FR engine </li></ul><ul><li>Not based on recognition of characters/words – suitable for documents with non-dictionary words or not practical to OCR to OCR (word spotting) </li></ul><ul><li>Used in other IMPACT methods: </li></ul><ul><ul><li>Typewritten OCR </li></ul></ul><ul><ul><li>Correction of arbitrary warping </li></ul></ul><ul><ul><li>Word spotting </li></ul></ul>date footertext
  5. 5. Hybrid Text Line Segmenter <ul><li>Hybrid approach based on connected component clustering and projection profiles </li></ul><ul><li>Connected component extraction (incl. noise filtering) </li></ul><ul><li>Group components into line candidates using an efficient data structure </li></ul><ul><li>Find and split under-segmented lines using local projection profiles </li></ul><ul><li>Merge small peripheral lines to appropriate neighbour (e.g. for i-dots etc.) </li></ul>Bitonal image Text regions (PAGE XML) Regions with text lines (PAGE XML) Parameters
  6. 6. Density Word Segmenter <ul><li>Adaptive projection-profile based approach using foreground pixel density </li></ul>Bitonal image Text regions and lines (PAGE XML) Regions, text lines and words (PAGE XML) Parameters <ul><li>For each text line: </li></ul><ul><ul><li>Generate vertical projection profile </li></ul></ul><ul><ul><li>Find delimiting white spaces using an adaptive threshold based on the density of foreground pixels in the line </li></ul></ul><ul><ul><li>Group connected components into words </li></ul></ul>
  7. 7. Evaluation <ul><li>Text line ground truth: 25 historical documents (more than 2700 text lines) </li></ul><ul><li>Results (using USAL layout evaluation tool): </li></ul><ul><li>Word ground truth: 15 historical documents (more than 14500 words) </li></ul><ul><li>Results (using USAL layout evaluation tool): </li></ul>
  8. 8. Further Information <ul><li>PRImA </li></ul><ul><ul><li>http://www.primaresearch.org </li></ul></ul><ul><li>IMPACT </li></ul><ul><ul><li>http://www.impact-project.eu </li></ul></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×