SlideShare a Scribd company logo
AN ALGORITHM
FOR PAGE SEGMENTATION
Alexey O. Shigarov1,2
Roman K. Fedorov1
10th International Conference on
PATTERN RECOGNITION and IMAGE ANALYSIS:
NEW INFORMATION TECHNOLOGIES
St. Petersburg, Russia
December 2010
1 Institute for System Dynamics and Control Theory, SB of RAS
2 e-mail: shigarov@icc.ru
2
Introduction
Page and table segmentation (or layout analysis) is a task of
Document Analysis and Recognition (DAR)
Page segmentation (document layout analysis) is dividing
document into parts (e.g. columns, figures, tables)
Existing approaches to the page segmentation
1st is to analyze text layout (structure)
e.g. using the Voronoi diagram for page
segmentation
2nd is to use page whitespace analysis
e.g. using the Largest empty rectangle problem
Figure from [Kise K., Sato A., Iwata M. Segmentation of page
images using the area Voronoi diagram // Computer Vision and
Image Understanding. Elsevier Science Inc. 1998. Vol. 70, No. 3. P.
370–382.]
Figure from [Orlowski M. A new algorithm for the largest empty
rectangle problem // Algorithmica. Springer New York. 1990. Vol. 5,
No. 1-4. P. 65–73.]
3
Problem Formulation
Page segmentation includes dividing multi-column text or table
into columns
Whitespace analysis can be used for detecting columns in
multi-column text or table
Our algorithm provides detecting whitespace gaps located
between text blocks on a document page
4
Algorithm. Input
Input
A bounding box (rectangle)
• It bounds a page or table
A set of obstacles (rectangles)
• Each obstacle bounds text block (e.g. word, some words, line)
• Each obstacle is inside the bounding box
• The obstacles don’t overlap each other
It is necessary to divide
the obstacles inside the
bounding box by
whitespace gaps
The algorithm consists of
two steps
5
Algorithm. Step 1
For each obstacle
First line (or rule) is extended from the left bound of the obstacle to up and down until
it is stopped by either any other obstacle, or the bounding box. In this case, each
resulting line is added in the set L1
Second line (or rule) is extended from the right bound of the rectangle by analogy
with the first case. In this case, each resulting line is added in the set L2
6
Algorithm. Step 2
Couples of lines (l1,l2) are formed.
Either the set L1 includes l1 or l1 is the right bound of the bounding box
Either the set L2 includes l2 or l2
is the left bound of the bounding box
There are no obstacles between l1 and l2
Top Y-coordinates of l1 and l2 are the same
Bottom Y-coordinates of l1 and l2 are the same
Each couple of lines (l1,l2)
is a whitespace gap
Output is the set
of whitespace gaps
Algorithm. Output
7
Using the algorithm for table detection
Text lines are grouped in table regions
Table regions are grouped in tables
8
Using the algorithm for table segmentation
Recovering table graphical lines (rules) can be used for table segmentation
Vertical lines are recovered by vertical whitespace gaps inside a table
Horizontal lines are recovered by horizontal whitespace gaps inside a table
Conclusion
1. Our algorithm can be used for
1. Multi-column text segmentation
2. Table detection
3. Table segmentation
2. Computational complexity of the algorithm is O(n2)
3. The algorithm is sufficient simple for implementation
(~60 statements of Object Pascal)

More Related Content

What's hot

Practice on Practical SQL
Practice on Practical SQLPractice on Practical SQL
Practice on Practical SQLHideshi Ogoshi
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structureVivek Kumar Sinha
 
linked lists in data structures
linked lists in data structureslinked lists in data structures
linked lists in data structuresDurgaDeviCbit
 
Introduction to data structure ppt
Introduction to data structure pptIntroduction to data structure ppt
Introduction to data structure pptNalinNishant3
 
Data structures - unit 1
Data structures - unit 1Data structures - unit 1
Data structures - unit 1SaranyaP45
 
Abstract data types (adt) intro to data structure part 2
Abstract data types (adt)   intro to data structure part 2Abstract data types (adt)   intro to data structure part 2
Abstract data types (adt) intro to data structure part 2Self-Employed
 
Introduction of Data Structure
Introduction of Data StructureIntroduction of Data Structure
Introduction of Data StructureMandavi Classes
 
What is Link list? explained with animations
What is Link list? explained with animationsWhat is Link list? explained with animations
What is Link list? explained with animationsPratikNaik41
 
Searching, Sorting and Hashing Techniques
Searching, Sorting and Hashing TechniquesSearching, Sorting and Hashing Techniques
Searching, Sorting and Hashing TechniquesSelvaraj Seerangan
 
Introduction to Data Structure part 1
Introduction to Data Structure part 1Introduction to Data Structure part 1
Introduction to Data Structure part 1ProfSonaliGholveDoif
 
data structure
data structuredata structure
data structurehashim102
 
Spreadsheet basics ppt
Spreadsheet basics pptSpreadsheet basics ppt
Spreadsheet basics pptTammy Carter
 
Spreadsheet fundamentals
Spreadsheet fundamentalsSpreadsheet fundamentals
Spreadsheet fundamentalscrystalpullen
 
Spreadsheet terminology
Spreadsheet terminologySpreadsheet terminology
Spreadsheet terminologyTammy Carter
 
Touring excel using terminologies
Touring excel using terminologiesTouring excel using terminologies
Touring excel using terminologiesmike2018
 
Files and data storage
Files and data storageFiles and data storage
Files and data storageZaid Shabbir
 
Elementary data organisation
Elementary data organisationElementary data organisation
Elementary data organisationMuzamil Hussain
 

What's hot (20)

Practice on Practical SQL
Practice on Practical SQLPractice on Practical SQL
Practice on Practical SQL
 
Introduction to data structure
Introduction to data structureIntroduction to data structure
Introduction to data structure
 
L 15 ct1120
L 15 ct1120L 15 ct1120
L 15 ct1120
 
linked lists in data structures
linked lists in data structureslinked lists in data structures
linked lists in data structures
 
Introduction to data structure ppt
Introduction to data structure pptIntroduction to data structure ppt
Introduction to data structure ppt
 
Data structures - unit 1
Data structures - unit 1Data structures - unit 1
Data structures - unit 1
 
Abstract data types (adt) intro to data structure part 2
Abstract data types (adt)   intro to data structure part 2Abstract data types (adt)   intro to data structure part 2
Abstract data types (adt) intro to data structure part 2
 
Introduction of Data Structure
Introduction of Data StructureIntroduction of Data Structure
Introduction of Data Structure
 
What is Link list? explained with animations
What is Link list? explained with animationsWhat is Link list? explained with animations
What is Link list? explained with animations
 
Searching, Sorting and Hashing Techniques
Searching, Sorting and Hashing TechniquesSearching, Sorting and Hashing Techniques
Searching, Sorting and Hashing Techniques
 
Introduction to Data Structure part 1
Introduction to Data Structure part 1Introduction to Data Structure part 1
Introduction to Data Structure part 1
 
Data structure
Data structureData structure
Data structure
 
data structure
data structuredata structure
data structure
 
Spreadsheet basics ppt
Spreadsheet basics pptSpreadsheet basics ppt
Spreadsheet basics ppt
 
Spreadsheet fundamentals
Spreadsheet fundamentalsSpreadsheet fundamentals
Spreadsheet fundamentals
 
Spreadsheet terminology
Spreadsheet terminologySpreadsheet terminology
Spreadsheet terminology
 
Touring excel using terminologies
Touring excel using terminologiesTouring excel using terminologies
Touring excel using terminologies
 
Files and data storage
Files and data storageFiles and data storage
Files and data storage
 
Unit 4.1 (tree)
Unit 4.1 (tree)Unit 4.1 (tree)
Unit 4.1 (tree)
 
Elementary data organisation
Elementary data organisationElementary data organisation
Elementary data organisation
 

Similar to A simple algorithm for page segmentation

LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONLATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONcsandit
 
Extract the ancient letters from decorated
Extract the ancient letters from decoratedExtract the ancient letters from decorated
Extract the ancient letters from decoratedIJERA Editor
 
Extract the ancient letters from decorated
Extract the ancient letters from decoratedExtract the ancient letters from decorated
Extract the ancient letters from decoratedIJERA Editor
 
Elastic path2path (International Conference on Image Processing'18)
Elastic path2path (International Conference on Image Processing'18)Elastic path2path (International Conference on Image Processing'18)
Elastic path2path (International Conference on Image Processing'18)TamalBatabyal
 
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONLATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONcscpconf
 
Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...CSCJournals
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem OntologyGiorgio Orsi
 
1 linear algebra matrices
1 linear algebra matrices1 linear algebra matrices
1 linear algebra matricesAmanSaeed11
 
From Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule EngineFrom Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule EngineAlexey Shigarov
 
Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...
Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...
Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...Waqas Tariq
 
Model reduction-of-linear-systems-by conventional-and-evolutionary-techniques
Model reduction-of-linear-systems-by conventional-and-evolutionary-techniquesModel reduction-of-linear-systems-by conventional-and-evolutionary-techniques
Model reduction-of-linear-systems-by conventional-and-evolutionary-techniquesCemal Ardil
 
Topology
TopologyTopology
Topologylxmota
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAmrinder Arora
 
Conceptual Fixture Design Method Based On Petri Net
Conceptual Fixture Design Method Based On Petri NetConceptual Fixture Design Method Based On Petri Net
Conceptual Fixture Design Method Based On Petri NetIJRES Journal
 
Van hulle springer:som
Van hulle springer:somVan hulle springer:som
Van hulle springer:somArchiLab 7
 
On the construction and comparison of an explicit iterative
On the construction and comparison of an explicit iterativeOn the construction and comparison of an explicit iterative
On the construction and comparison of an explicit iterativeAlexander Decker
 
A combined-conventional-and-differential-evolution-method-for-model-order-red...
A combined-conventional-and-differential-evolution-method-for-model-order-red...A combined-conventional-and-differential-evolution-method-for-model-order-red...
A combined-conventional-and-differential-evolution-method-for-model-order-red...Cemal Ardil
 
Introduction to Data Structure
Introduction to Data StructureIntroduction to Data Structure
Introduction to Data StructureJazz Jinia Bhowmik
 
Quantitative Digit in a Decoupled Positional System: A New Method of Understa...
Quantitative Digit in a Decoupled Positional System: A New Method of Understa...Quantitative Digit in a Decoupled Positional System: A New Method of Understa...
Quantitative Digit in a Decoupled Positional System: A New Method of Understa...inventionjournals
 

Similar to A simple algorithm for page segmentation (20)

LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONLATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
 
Extract the ancient letters from decorated
Extract the ancient letters from decoratedExtract the ancient letters from decorated
Extract the ancient letters from decorated
 
Extract the ancient letters from decorated
Extract the ancient letters from decoratedExtract the ancient letters from decorated
Extract the ancient letters from decorated
 
Elastic path2path (International Conference on Image Processing'18)
Elastic path2path (International Conference on Image Processing'18)Elastic path2path (International Conference on Image Processing'18)
Elastic path2path (International Conference on Image Processing'18)
 
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATIONLATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
LATTICE-CELL : HYBRID APPROACH FOR TEXT CATEGORIZATION
 
Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...Header Based Classification of Journals Using Document Image Segmentation and...
Header Based Classification of Journals Using Document Image Segmentation and...
 
array.pptx
array.pptxarray.pptx
array.pptx
 
The Diadem Ontology
The Diadem OntologyThe Diadem Ontology
The Diadem Ontology
 
1 linear algebra matrices
1 linear algebra matrices1 linear algebra matrices
1 linear algebra matrices
 
From Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule EngineFrom Unstructured to Structured Tabular Data Using a Rule Engine
From Unstructured to Structured Tabular Data Using a Rule Engine
 
Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...
Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...
Certain Algebraic Procedures for the Aperiodic Stability Analysis and Countin...
 
Model reduction-of-linear-systems-by conventional-and-evolutionary-techniques
Model reduction-of-linear-systems-by conventional-and-evolutionary-techniquesModel reduction-of-linear-systems-by conventional-and-evolutionary-techniques
Model reduction-of-linear-systems-by conventional-and-evolutionary-techniques
 
Topology
TopologyTopology
Topology
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data Structures
 
Conceptual Fixture Design Method Based On Petri Net
Conceptual Fixture Design Method Based On Petri NetConceptual Fixture Design Method Based On Petri Net
Conceptual Fixture Design Method Based On Petri Net
 
Van hulle springer:som
Van hulle springer:somVan hulle springer:som
Van hulle springer:som
 
On the construction and comparison of an explicit iterative
On the construction and comparison of an explicit iterativeOn the construction and comparison of an explicit iterative
On the construction and comparison of an explicit iterative
 
A combined-conventional-and-differential-evolution-method-for-model-order-red...
A combined-conventional-and-differential-evolution-method-for-model-order-red...A combined-conventional-and-differential-evolution-method-for-model-order-red...
A combined-conventional-and-differential-evolution-method-for-model-order-red...
 
Introduction to Data Structure
Introduction to Data StructureIntroduction to Data Structure
Introduction to Data Structure
 
Quantitative Digit in a Decoupled Positional System: A New Method of Understa...
Quantitative Digit in a Decoupled Positional System: A New Method of Understa...Quantitative Digit in a Decoupled Positional System: A New Method of Understa...
Quantitative Digit in a Decoupled Positional System: A New Method of Understa...
 

Recently uploaded

GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxSultanMuhammadGhauri
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxmuralinath2
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONChetanK57
 
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdfGEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdfUniversity of Barishal
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard Gill
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSEjordanparish425
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionpablovgd
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSELF-EXPLANATORY
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureSérgio Sacani
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxANONYMOUS
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsmuralinath2
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinossaicprecious19
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Sérgio Sacani
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...PABOLU TEJASREE
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesAlex Henderson
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxAlguinaldoKong
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPirithiRaju
 
electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxHusna Zaheer
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...muralinath2
 

Recently uploaded (20)

GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptxGLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
GLOBAL AND LOCAL SCENARIO OF FOOD AND NUTRITION.pptx
 
platelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptxplatelets- lifespan -Clot retraction-disorders.pptx
platelets- lifespan -Clot retraction-disorders.pptx
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdfGEOLOGICAL FIELD REPORT  On  Kaptai Rangamati Road-Cut Section.pdf
GEOLOGICAL FIELD REPORT On Kaptai Rangamati Road-Cut Section.pdf
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
Transport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSETransport in plants G1.pptx Cambridge IGCSE
Transport in plants G1.pptx Cambridge IGCSE
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdfSCHIZOPHRENIA Disorder/ Brain Disorder.pdf
SCHIZOPHRENIA Disorder/ Brain Disorder.pdf
 
Detectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a TechnosignatureDetectability of Solar Panels as a Technosignature
Detectability of Solar Panels as a Technosignature
 
biotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptxbiotech-regenration of plants, pharmaceutical applications.pptx
biotech-regenration of plants, pharmaceutical applications.pptx
 
Anemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditionsAnemia_ different types_causes_ conditions
Anemia_ different types_causes_ conditions
 
Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...Multi-source connectivity as the driver of solar wind variability in the heli...
Multi-source connectivity as the driver of solar wind variability in the heli...
 
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
Gliese 12 b, a temperate Earth-sized planet at 12 parsecs discovered with TES...
 
National Biodiversity protection initiatives and Convention on Biological Di...
National Biodiversity protection initiatives and  Convention on Biological Di...National Biodiversity protection initiatives and  Convention on Biological Di...
National Biodiversity protection initiatives and Convention on Biological Di...
 
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS imagesFAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Towards a common data file format for SIMS images
 
EY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptxEY - Supply Chain Services 2018_template.pptx
EY - Supply Chain Services 2018_template.pptx
 
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdfPests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
Pests of Green Manures_Bionomics_IPM_Dr.UPR.pdf
 
electrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptxelectrochemical gas sensors and their uses.pptx
electrochemical gas sensors and their uses.pptx
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 

A simple algorithm for page segmentation

  • 1. AN ALGORITHM FOR PAGE SEGMENTATION Alexey O. Shigarov1,2 Roman K. Fedorov1 10th International Conference on PATTERN RECOGNITION and IMAGE ANALYSIS: NEW INFORMATION TECHNOLOGIES St. Petersburg, Russia December 2010 1 Institute for System Dynamics and Control Theory, SB of RAS 2 e-mail: shigarov@icc.ru
  • 2. 2 Introduction Page and table segmentation (or layout analysis) is a task of Document Analysis and Recognition (DAR) Page segmentation (document layout analysis) is dividing document into parts (e.g. columns, figures, tables) Existing approaches to the page segmentation 1st is to analyze text layout (structure) e.g. using the Voronoi diagram for page segmentation 2nd is to use page whitespace analysis e.g. using the Largest empty rectangle problem Figure from [Kise K., Sato A., Iwata M. Segmentation of page images using the area Voronoi diagram // Computer Vision and Image Understanding. Elsevier Science Inc. 1998. Vol. 70, No. 3. P. 370–382.] Figure from [Orlowski M. A new algorithm for the largest empty rectangle problem // Algorithmica. Springer New York. 1990. Vol. 5, No. 1-4. P. 65–73.]
  • 3. 3 Problem Formulation Page segmentation includes dividing multi-column text or table into columns Whitespace analysis can be used for detecting columns in multi-column text or table Our algorithm provides detecting whitespace gaps located between text blocks on a document page
  • 4. 4 Algorithm. Input Input A bounding box (rectangle) • It bounds a page or table A set of obstacles (rectangles) • Each obstacle bounds text block (e.g. word, some words, line) • Each obstacle is inside the bounding box • The obstacles don’t overlap each other It is necessary to divide the obstacles inside the bounding box by whitespace gaps The algorithm consists of two steps
  • 5. 5 Algorithm. Step 1 For each obstacle First line (or rule) is extended from the left bound of the obstacle to up and down until it is stopped by either any other obstacle, or the bounding box. In this case, each resulting line is added in the set L1 Second line (or rule) is extended from the right bound of the rectangle by analogy with the first case. In this case, each resulting line is added in the set L2
  • 6. 6 Algorithm. Step 2 Couples of lines (l1,l2) are formed. Either the set L1 includes l1 or l1 is the right bound of the bounding box Either the set L2 includes l2 or l2 is the left bound of the bounding box There are no obstacles between l1 and l2 Top Y-coordinates of l1 and l2 are the same Bottom Y-coordinates of l1 and l2 are the same Each couple of lines (l1,l2) is a whitespace gap Output is the set of whitespace gaps Algorithm. Output
  • 7. 7 Using the algorithm for table detection Text lines are grouped in table regions Table regions are grouped in tables
  • 8. 8 Using the algorithm for table segmentation Recovering table graphical lines (rules) can be used for table segmentation Vertical lines are recovered by vertical whitespace gaps inside a table Horizontal lines are recovered by horizontal whitespace gaps inside a table Conclusion 1. Our algorithm can be used for 1. Multi-column text segmentation 2. Table detection 3. Table segmentation 2. Computational complexity of the algorithm is O(n2) 3. The algorithm is sufficient simple for implementation (~60 statements of Object Pascal)