The presentation shows a simple algorithm for page segmentation based on whitespace analysis. It can be used to locate table or page columns. You can find more information at http://cells.icc.ru
1. AN ALGORITHM
FOR PAGE SEGMENTATION
Alexey O. Shigarov1,2
Roman K. Fedorov1
10th International Conference on
PATTERN RECOGNITION and IMAGE ANALYSIS:
NEW INFORMATION TECHNOLOGIES
St. Petersburg, Russia
December 2010
1 Institute for System Dynamics and Control Theory, SB of RAS
2 e-mail: shigarov@icc.ru
2. 2
Introduction
Page and table segmentation (or layout analysis) is a task of
Document Analysis and Recognition (DAR)
Page segmentation (document layout analysis) is dividing
document into parts (e.g. columns, figures, tables)
Existing approaches to the page segmentation
1st is to analyze text layout (structure)
e.g. using the Voronoi diagram for page
segmentation
2nd is to use page whitespace analysis
e.g. using the Largest empty rectangle problem
Figure from [Kise K., Sato A., Iwata M. Segmentation of page
images using the area Voronoi diagram // Computer Vision and
Image Understanding. Elsevier Science Inc. 1998. Vol. 70, No. 3. P.
370–382.]
Figure from [Orlowski M. A new algorithm for the largest empty
rectangle problem // Algorithmica. Springer New York. 1990. Vol. 5,
No. 1-4. P. 65–73.]
3. 3
Problem Formulation
Page segmentation includes dividing multi-column text or table
into columns
Whitespace analysis can be used for detecting columns in
multi-column text or table
Our algorithm provides detecting whitespace gaps located
between text blocks on a document page
4. 4
Algorithm. Input
Input
A bounding box (rectangle)
• It bounds a page or table
A set of obstacles (rectangles)
• Each obstacle bounds text block (e.g. word, some words, line)
• Each obstacle is inside the bounding box
• The obstacles don’t overlap each other
It is necessary to divide
the obstacles inside the
bounding box by
whitespace gaps
The algorithm consists of
two steps
5. 5
Algorithm. Step 1
For each obstacle
First line (or rule) is extended from the left bound of the obstacle to up and down until
it is stopped by either any other obstacle, or the bounding box. In this case, each
resulting line is added in the set L1
Second line (or rule) is extended from the right bound of the rectangle by analogy
with the first case. In this case, each resulting line is added in the set L2
6. 6
Algorithm. Step 2
Couples of lines (l1,l2) are formed.
Either the set L1 includes l1 or l1 is the right bound of the bounding box
Either the set L2 includes l2 or l2
is the left bound of the bounding box
There are no obstacles between l1 and l2
Top Y-coordinates of l1 and l2 are the same
Bottom Y-coordinates of l1 and l2 are the same
Each couple of lines (l1,l2)
is a whitespace gap
Output is the set
of whitespace gaps
Algorithm. Output
7. 7
Using the algorithm for table detection
Text lines are grouped in table regions
Table regions are grouped in tables
8. 8
Using the algorithm for table segmentation
Recovering table graphical lines (rules) can be used for table segmentation
Vertical lines are recovered by vertical whitespace gaps inside a table
Horizontal lines are recovered by horizontal whitespace gaps inside a table
Conclusion
1. Our algorithm can be used for
1. Multi-column text segmentation
2. Table detection
3. Table segmentation
2. Computational complexity of the algorithm is O(n2)
3. The algorithm is sufficient simple for implementation
(~60 statements of Object Pascal)