Text extraction using document structure features and support vector machines

using Document Structure Features
and Support Vector Machines
Konstantinos Zagoris, Nikos Papamarkos
Image Processing and Multimedia Laboratory
Department of Electrical & Computer Engineering
Democritus University of Thrace
67100 Xanthi, Greece
Email: papamark@ee.duth.gr
http://ipml.ee.duth.gr/~papamark/

 Nowadays, there is abundance of document
images such as technical articles, business letters,
faxes and newspapers without any indexing
information
 In order to successfully exploit them from systems
such as OCR a text localization technique must be
employed

 Bottom Up Techniques
 Top Down Techniques

The proposed technique is a continuation
of the work “"PLA using RLSA and a neural
network” of C. Strouthopoulos, N.
Papamarkos and C. Chamzas
In proposed work the feature set is
adaptive
The feature reduction technique is simpler
 The classifier is the SVMs

Apply
Preprocessing
Techniques
(binarization,
noire reduction)
Locate, Merge and
Extract Blocks
Extract the
Features from the
Blocks
Find the Blocks
Which Contain
Text using
Support Vector
Machines
Locate or Extract
the Text Blocks
and Present them
to User

 The Original Document  After the Pre-Processing Step
 The Connected Components  The Expanded Connected Components
 The Final Blocks

The Features are a set of suitable
Document Structure Elements (DSEs)
which the blocks contain
DSE is any 3x3 binary block
There are total 29 = 512 DSEs
b0
b8 b7 b6
b5 b4 b3
b2 b1
The Pixel Order of the DSEs
8
0
2i
j ji
i
L b

 
The DSE of L142

 The initial descriptor of the block is the histogram
of the DSEs that the block contains
 The length of the initial descriptor is 510
 The L0 and L511 DSEs are removed because they
correspond to pure background and pure
document objects, respectively
 A feature reduction algorithm is applied which
reduces the number of features.
 The selected features are the DSEs which they
most reliable separate the text blocks from the
others.
 We call this feature reduction algorithm Feature
Standard Deviation Analysis of Structure Elements
(FSDASE)

Find the Standard Deviation for the Text
Blocks SDXT(Ln) for each Ln DSE
Find the Standard Deviation for the non
Text Blocks SDXP(Ln) for each Ln DSE
Normalize them
Then define the O(Ln) vector as
O(Ln)=|SDXT´ (Ln) – SDXP´(Ln)|
Finally, take those 32 DSEs that
correspond to the first 32 maximum values
of O(Ln).

 The goal of the FSDASE is to find those
DSEs that have maximum SD at the text
blocks and minimum SD at the non text
blocks and the opposite
 A training dataset is required
 Does not cause a problem because such
dataset already is required for the training of
the SVMs
 Therefore the final block descriptor is a vector
with 32 elements and it corresponds to the
frequency of the 32 DSEs that the block
contains

The descriptor has the ability to adapt to
the demands of each set of documents
images
A noisy document has different set of
DSEs than a clear document
If there is available more computational
power, the descriptor can increase its size
easily above 32
This descriptor is used to train the Support
Vector Machines

 Based on statistical learning theory
 They need training data
 They separate the space that the training
data is reside to two classes.
 The training data must be linear separable.

 If the training data are not linear separable (as in our case)
then they mapped from the input space to a feature space
using the kernel method
 Our experiments showed the Radial Basis Function
(exp{-γ|x-x`|) as the most robust kernel
 The parameters of SVMs are detected by a cross-
validation procedure using a grid search
 The output of SVM classifies each block as text or not

 The Document Image Database from the University of
Oulu is employed
 In our experiments we used the set of the 48 article
documents
 Those image documents contained a mixture of text
and pictures
 From this database five images are selected and the
extracted blocks used to determine the proper DSEs
and to be employ as training samples for the SVMs
 The overall results are:
Document Images Blocks Success Rate
48 25958 98.453%

ORIGINAL IMAGE
THE OUTPUT OF THE
PROPOSED METHOD

 A bottom-up text localization technique is proposed
that detects and extracts homogeneous text from
document images
 A Connected Component analysis technique is applied
which detects the objects of the document
 A flexible descriptor is extracted based on structural
elements
 The descriptor has the ability to adapt to the demands
of each set of documents images
 For example a noisy document has different set of
DSEs than a clear document
 If there is available more computational power, the
descriptor can increase its size easily above 32
 A trained SVM classify the objects as text and non-text
 The experimental results are much promised

ΕΥΧΑΡΙΣΤΩ!
THANK YOU!

Text extraction using document structure features and support vector machines

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Text extraction using document structure features and support vector machines

Similar to Text extraction using document structure features and support vector machines (20)

Recently uploaded

Recently uploaded (20)

Text extraction using document structure features and support vector machines