Università degli studi di Bari “Aldo Moro”                         Dipartimento di Informatica      A Run Length Smoothing...
Introduction● Automatic document processing a hot topic  ― Layout analysis a fundamental step    ● Identification of frame...
RLSO                   Application to scanned imagesRLSO (Run Length Smoothing with OR)1) horizontal smoothing with thresh...
RLSO                         ?Application to scanned images
RLSO              Application to born-digital documents●   Set horizontal/vertical distance thresholds th/tv●   build a fr...
RLSOApplication to born-digital documents
RLSO●   Run Length Smoothing algorithms based on thresholds    ―   Hard to properly set manually (Not typical human activi...
RLSO                   Automatic threshold assessment●   Study of Run Lengths behavior                                    ...
RLSO                    Automatic threshold assessment●   Select threshold on flat zones    ― Derivative a good indicator ...
Sample Evaluation
Conclusions●   RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the    document image and fill them w...
Upcoming SlideShare
Loading in …5
×

A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

1,507 views

Published on

Layout analysis is a fundamental step in automatic document processing, because its outcome affects all subsequent processing steps. Many different techniques have been proposed to perform this task. In this work, we propose a general bottom-up strategy to tackle the layout analysis of (possibly) non-Manhattan documents, and two specializations of it to handle both bitmap and PS/PDF sources. A famous approach proposed in the literature for layout analysis was the RLSA. Here we consider a variant of RLSA, called RLSO (short for “Run Length Smoothing with OR”), that exploits the OR logical operator instead of the AND and is particularly indicated for the identification of frames in non-Manhattan layouts. Like RLSA, RLSO is based on thresholds, but based on different criteria than those that work in RLSA. Since setting such thresholds is a hard and unnatural task for (even expert) users, and no single threshold can fit all documents, we developed a technique to automatically define such thresholds for each specific document, based on the distribution of spacing therein. Application on selected sample documents, that cover a significant landscape of real cases, revealed that the approach is satisfactory for documents characterized by the use of a uniform text font size.

Published in: Technology
5 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total views
1,507
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
13
Comments
5
Likes
2
Embeds 0
No embeds

No notes for slide

A Run Length Smoothing-Based Algorithm for Non-Manhattan Document Segmentation

  1. 1. Università degli studi di Bari “Aldo Moro” Dipartimento di Informatica A Run Length Smoothing-Based Algorithm for non-Manhattan Document Segmentation S. Ferilli, F. Leuzzi, F. Rotella, F. Esposito Via Orabona, 4 - 70126 Bari – Italy {ferilli, esposito}@di.uniba.itL.A.C.A.M. {fabio.leuzzi, fulvio.rotella}@uniba.ithttp://lacam.di.uniba.it
  2. 2. Introduction● Automatic document processing a hot topic ― Layout analysis a fundamental step ● Identification of frames (relevant components in the document) ● Performance can determine quality and feasibility of the whole process● Two different… ● Kinds of sources: Digitized (scanned) vs. Natively digital documents ● Categories of layouts: Manhattan vs. Non-Manhattan ● Types of algorithms: Top-down vs. Bottom-up● Run Length Smoothing Algorithm ● Manhattan Layout● Other works exploit or try to improve the RLSA by setting its parameters● Many works on Manhattan layout ― Top-down strategies● Less works on non-Manhattan layout ― Bottom-up strategies● The Manhattan assumption holds for many typeset documents, simplifies document processing…BUT cannot be assumed in general
  3. 3. RLSO Application to scanned imagesRLSO (Run Length Smoothing with OR)1) horizontal smoothing with threshold th, row by row2) vertical smoothing with threshold tv, column by column● logical OR of the images obtained in steps 1 and 2 th = 5 tv = 4 (AND)
  4. 4. RLSO ?Application to scanned images
  5. 5. RLSO Application to born-digital documents● Set horizontal/vertical distance thresholds th/tv● build a frame for each basic block● H ={(dh, b’, b’’) | b’ and b’’ are horizontally adjacent basic blocks and dh is the horizontal distance between them}●for all (dh,1, b’h,1, b’’h,1) ∈ H s.t. dh,1 ≤ th merge the frames to which b’h,1, b’’h,1belong● V = {(dv, b’, b’’) | b’ and b’’ are vertically adjacent basic blocks and dv is the vertical distance between them}● for all (dv,1, b’h,1, b’’h,1) ∈ V s.t. dv,1 ≤ tv merge the frames to which b’h,1, b’’h,1 belong Reference block Adjacent blocks Non-adjacent blocks Horizontal distance Vertical distance
  6. 6. RLSOApplication to born-digital documents
  7. 7. RLSO● Run Length Smoothing algorithms based on thresholds ― Hard to properly set manually (Not typical human activity) ― Heuristic approaches (Ad hoc) ― Tampers the idea of automatic processing ― Fixed thresholds not suitable to documents with several different spacings Automatic assessment of RLSO thresholds
  8. 8. RLSO Automatic threshold assessment● Study of Run Lengths behavior Figure 1. a fragment of ― Histogram very irregular scientific paper ● Peaks = most frequent spacings ● Peak clusters = equally spaced components ― Hard to exploit by automatic techniques ― Cumulative histograms more regular ― Bar b = runs larger or equal than b H’(i) = ∑ j≥ i H(j) ● Monotonically decreasing ― Flat zones = lengths for which no runs are present ● Scaled down to 10% ― Reduces variability
  9. 9. RLSO Automatic threshold assessment● Select threshold on flat zones ― Derivative a good indicator ● Slope = 0 ● Discrete approximation on bar b: ― Tolerance possible Figure 1-a. ● Slope = – 30 ― Skip starting and trailing flat zones ● Starting zone = missing small b run lengths ● Trailing zone = merge whole content Figure 1-b.● Iteration of technique on previously smoothed image ― Finds progressively more (Figure 1-a/1-b) successive application of RLSO with spaced components automatic threshold assessment on Figure 1.
  10. 10. Sample Evaluation
  11. 11. Conclusions● RLSO (Run Length Smoothing with OR) identifies runs of white pixel in the document image and fill them with black pixels whenever they are shorter than a given threshold – Both Manhattan and Non-Manhattan Layout – Version for natively digital documents● Automatic thresholding effective on documents having – single character size – different spacings● Good baseline towards more complex documents – different character sizes – graphics● Current and future Work – Stop criterion for iteration – Clustering based on positioning and spacing

×