Developing Document Image Retrieval System

K. Zagoris, K. Ergina and N. Papamarkos
Image Processing and Multimedia Laboratory
Department of Electrical & Computer Engineering
Democritus University of Thrace,
67100 Xanthi, Greece

 Phenomenal growth of the size of multimedia data
and especially document images
 Caused by the easiness to create such images using
scanners or digital cameras
 Huge quantities of document images are created and
stored in image archives without having any indexing
information

Theoverall structureof the DocumentImage Retrieval System

 Binarization
(Otsu Technique)
 Original Document
 Median Filter

 Indentify all the Connected Components (CCs)
 Calculate the most common height of the
document CCs (CCch)
 Reject the CCs with height less than 70% of the
CCch. That only reject areas of punctuation
points and noise.
 Expand the left and right sides of the resulted
CCs by 20% of the CCch
 The words are the merged overlapping CCs
Using the Connected
Components Labeling and
Filtering method
Word
Segmentation

 Width to Height Ratio
 Word Area Density. The percentage of the black
pixels included in the word-bounding box
 Center of Gravity. The Euclidean distance from the
word’s center of gravity to the upper left corner of the
bounding box:
(1,0) (0,1)
(0,0) (0,0)
,x y
M M
C C
M M
 
( , )
qp
pq
x y
x y
M f x y
width height
  
   
   


 Vertical Projection. The first twenty (20) coefficients
of the Discrete Cosine Transform (DCT) of the
smoothed and normalized vertical projection.
 Original Image
 The Vertical
Projection
 Smoothed and
normalized

 Top – Bottom Shape Projections. A vector of 50 elements
 The first 25 values are the first 25 coefficients of the smoothed
and normalized Top Shape Projection DCT
 The rest 25 values are equal to the first 25 coefficients of the
smoothed and normalized Bottom Shape Projection DCT.

 Upper Grid Features is a ten element vector with
binary values which are extracted from the upper part
of each word image.
 Down Grid Features is a ten element vector with
binary values which are extracted from the lower part
of the word image.

[0,0,0,1195 ,0,0,0,0,0,0]
[0,0,0,1 ,0,0,0,0,0,0]
[0,0,0,0 ,0,0,0, 598 , 50 , 33 ]
[0,0,0,0 ,0,0,0,1,1,0]

Descriptor
The Structure of the
Descriptor

 User enters a query word
 The proposed system creates an image of the query
word with font height equal to the average height of all
the word-boxes obtained through the Word
Segmentation stage of the Offline operation.
 For our experimental set the average height is 50
 The font type of the query image is Arial
 The smoothing and normalizing of the various
features described before, suppress small differences
between various types of fonts

 100 image documents created artificially from various
texts
 Then Gaussian and “Salt and Pepper” noise was added
 Implement in parallel a text search engine which
makes easier the verification and evaluation of the
search results of the proposed system

Implementation
o Visual Studio 2008
o Microsoft .NET
Framework 2.0
o C# Language
o Microsoft SQL
Server 2005
http://orpheus.ee.duth.gr/irs2_5/

Evaluation
o Precision and the
Recall metrics
o 30 searches in 100
document images
o Font Query: Arial
 Mean Precision: 87.8%
 Mean Recall: 99.26%

FineReader® 9.0 OCR Program Query Font Name “Tahoma”.

 The query word is given in text and then transformed
to word image
 The proposed system extract nine (9) powerful
features for the description of the word images
 These features describe satisfactorily the shape of the
words while at the same moment they suppress small
differences due to noise, size and type of fonts
 Based on our experiments the proposed system
performs better in the same database than a
commercial OCR package

Developing Document Image Retrieval System

Developing Document Image Retrieval System

More Related Content

What's hot

Similar to Developing Document Image Retrieval System

Recently uploaded

Developing Document Image Retrieval System