K. Zagoris, K. Ergina and N. Papamarkos
Image Processing and Multimedia Laboratory
Department of Electrical & Computer Engineering
Democritus University of Thrace,
67100 Xanthi, Greece
 Phenomenal growth of the size of multimedia data
and especially document images
 Caused by the easiness to create such images using
scanners or digital cameras
 Huge quantities of document images are created and
stored in image archives without having any indexing
information
Theoverall structureof the DocumentImage Retrieval System
 Binarization
(Otsu Technique)
 Original Document
 Median Filter
 Indentify all the Connected Components (CCs)
 Calculate the most common height of the
document CCs (CCch)
 Reject the CCs with height less than 70% of the
CCch. That only reject areas of punctuation
points and noise.
 Expand the left and right sides of the resulted
CCs by 20% of the CCch
 The words are the merged overlapping CCs
Using the Connected
Components Labeling and
Filtering method
Word
Segmentation
 Width to Height Ratio
 Word Area Density. The percentage of the black
pixels included in the word-bounding box
 Center of Gravity. The Euclidean distance from the
word’s center of gravity to the upper left corner of the
bounding box:
(1,0) (0,1)
(0,0) (0,0)
,x y
M M
C C
M M
 
( , )
qp
pq
x y
x y
M f x y
width height
  
   
   

 Vertical Projection. The first twenty (20) coefficients
of the Discrete Cosine Transform (DCT) of the
smoothed and normalized vertical projection.
 Original Image
 The Vertical
Projection
 Smoothed and
normalized
 Top – Bottom Shape Projections. A vector of 50 elements
 The first 25 values are the first 25 coefficients of the smoothed
and normalized Top Shape Projection DCT
 The rest 25 values are equal to the first 25 coefficients of the
smoothed and normalized Bottom Shape Projection DCT.
 Upper Grid Features is a ten element vector with
binary values which are extracted from the upper part
of each word image.
 Down Grid Features is a ten element vector with
binary values which are extracted from the lower part
of the word image.
[0,0,0,1195 ,0,0,0,0,0,0]
[0,0,0,1 ,0,0,0,0,0,0]
[0,0,0,0 ,0,0,0, 598 , 50 , 33 ]
[0,0,0,0 ,0,0,0,1,1,0]
Descriptor
The Structure of the
Descriptor
 User enters a query word
 The proposed system creates an image of the query
word with font height equal to the average height of all
the word-boxes obtained through the Word
Segmentation stage of the Offline operation.
 For our experimental set the average height is 50
 The font type of the query image is Arial
 The smoothing and normalizing of the various
features described before, suppress small differences
between various types of fonts
The Matching Process
 100 image documents created artificially from various
texts
 Then Gaussian and “Salt and Pepper” noise was added
 Implement in parallel a text search engine which
makes easier the verification and evaluation of the
search results of the proposed system
Implementation
o Visual Studio 2008
o Microsoft .NET
Framework 2.0
o C# Language
o Microsoft SQL
Server 2005
http://orpheus.ee.duth.gr/irs2_5/
Evaluation
o Precision and the
Recall metrics
o 30 searches in 100
document images
o Font Query: Arial
 Mean Precision: 87.8%
 Mean Recall: 99.26%
FineReader® 9.0 OCR Program Query Font Name “Tahoma”.
 Mean Precision: 76.67%
 Mean Recall: 58.42%
 Mean Precision: 89.44%
 Mean Recall: 88.05%
 The query word is given in text and then transformed
to word image
 The proposed system extract nine (9) powerful
features for the description of the word images
 These features describe satisfactorily the shape of the
words while at the same moment they suppress small
differences due to noise, size and type of fonts
 Based on our experiments the proposed system
performs better in the same database than a
commercial OCR package
Developing Document Image Retrieval System

Developing Document Image Retrieval System

  • 1.
    K. Zagoris, K.Ergina and N. Papamarkos Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace, 67100 Xanthi, Greece
  • 2.
     Phenomenal growthof the size of multimedia data and especially document images  Caused by the easiness to create such images using scanners or digital cameras  Huge quantities of document images are created and stored in image archives without having any indexing information
  • 3.
    Theoverall structureof theDocumentImage Retrieval System
  • 4.
     Binarization (Otsu Technique) Original Document  Median Filter
  • 5.
     Indentify allthe Connected Components (CCs)  Calculate the most common height of the document CCs (CCch)  Reject the CCs with height less than 70% of the CCch. That only reject areas of punctuation points and noise.  Expand the left and right sides of the resulted CCs by 20% of the CCch  The words are the merged overlapping CCs Using the Connected Components Labeling and Filtering method Word Segmentation
  • 6.
     Width toHeight Ratio  Word Area Density. The percentage of the black pixels included in the word-bounding box  Center of Gravity. The Euclidean distance from the word’s center of gravity to the upper left corner of the bounding box: (1,0) (0,1) (0,0) (0,0) ,x y M M C C M M   ( , ) qp pq x y x y M f x y width height            
  • 7.
     Vertical Projection.The first twenty (20) coefficients of the Discrete Cosine Transform (DCT) of the smoothed and normalized vertical projection.  Original Image  The Vertical Projection  Smoothed and normalized
  • 8.
     Top –Bottom Shape Projections. A vector of 50 elements  The first 25 values are the first 25 coefficients of the smoothed and normalized Top Shape Projection DCT  The rest 25 values are equal to the first 25 coefficients of the smoothed and normalized Bottom Shape Projection DCT.
  • 9.
     Upper GridFeatures is a ten element vector with binary values which are extracted from the upper part of each word image.  Down Grid Features is a ten element vector with binary values which are extracted from the lower part of the word image.
  • 10.
    [0,0,0,1195 ,0,0,0,0,0,0] [0,0,0,1 ,0,0,0,0,0,0] [0,0,0,0,0,0,0, 598 , 50 , 33 ] [0,0,0,0 ,0,0,0,1,1,0]
  • 11.
  • 12.
     User entersa query word  The proposed system creates an image of the query word with font height equal to the average height of all the word-boxes obtained through the Word Segmentation stage of the Offline operation.  For our experimental set the average height is 50  The font type of the query image is Arial  The smoothing and normalizing of the various features described before, suppress small differences between various types of fonts
  • 13.
  • 14.
     100 imagedocuments created artificially from various texts  Then Gaussian and “Salt and Pepper” noise was added  Implement in parallel a text search engine which makes easier the verification and evaluation of the search results of the proposed system
  • 15.
    Implementation o Visual Studio2008 o Microsoft .NET Framework 2.0 o C# Language o Microsoft SQL Server 2005 http://orpheus.ee.duth.gr/irs2_5/
  • 16.
    Evaluation o Precision andthe Recall metrics o 30 searches in 100 document images o Font Query: Arial  Mean Precision: 87.8%  Mean Recall: 99.26%
  • 17.
    FineReader® 9.0 OCRProgram Query Font Name “Tahoma”.  Mean Precision: 76.67%  Mean Recall: 58.42%  Mean Precision: 89.44%  Mean Recall: 88.05%
  • 18.
     The queryword is given in text and then transformed to word image  The proposed system extract nine (9) powerful features for the description of the word images  These features describe satisfactorily the shape of the words while at the same moment they suppress small differences due to noise, size and type of fonts  Based on our experiments the proposed system performs better in the same database than a commercial OCR package