Your SlideShare is downloading. ×
  • Like
Developing Document Image Retrieval System
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Developing Document Image Retrieval System

  • 2,234 views
Published

A system was developed able to retrieve specific documents from a document collection. In this system the query is given in text by the user and then transformed into image. Appropriate features were …

A system was developed able to retrieve specific documents from a document collection. In this system the query is given in text by the user and then transformed into image. Appropriate features were in order to capture the general shape of the query, and ignore details due to noise or different fonts. In order to demonstrate the effectiveness of our system, we used a collection of noisy documents and we compared our results with those of a commercial OCR package.

Published in Technology , Art & Photos
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,234
On SlideShare
0
From Embeds
0
Number of Embeds
26

Actions

Shares
Downloads
31
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. K. Zagoris, K. Ergina and N. Papamarkos Image Processing and Multimedia Laboratory Department of Electrical & Computer Engineering Democritus University of Thrace, 67100 Xanthi, Greece
  • 2.  Phenomenal growth of the size of multimedia data and especially document images  Caused by the easiness to create such images using scanners or digital cameras  Huge quantities of document images are created and stored in image archives without having any indexing information
  • 3. Theoverall structureof the DocumentImage Retrieval System
  • 4.  Binarization (Otsu Technique)  Original Document  Median Filter
  • 5.  Indentify all the Connected Components (CCs)  Calculate the most common height of the document CCs (CCch)  Reject the CCs with height less than 70% of the CCch. That only reject areas of punctuation points and noise.  Expand the left and right sides of the resulted CCs by 20% of the CCch  The words are the merged overlapping CCs Using the Connected Components Labeling and Filtering method Word Segmentation
  • 6.  Width to Height Ratio  Word Area Density. The percentage of the black pixels included in the word-bounding box  Center of Gravity. The Euclidean distance from the word’s center of gravity to the upper left corner of the bounding box: (1,0) (0,1) (0,0) (0,0) ,x y M M C C M M   ( , ) qp pq x y x y M f x y width height            
  • 7.  Vertical Projection. The first twenty (20) coefficients of the Discrete Cosine Transform (DCT) of the smoothed and normalized vertical projection.  Original Image  The Vertical Projection  Smoothed and normalized
  • 8.  Top – Bottom Shape Projections. A vector of 50 elements  The first 25 values are the first 25 coefficients of the smoothed and normalized Top Shape Projection DCT  The rest 25 values are equal to the first 25 coefficients of the smoothed and normalized Bottom Shape Projection DCT.
  • 9.  Upper Grid Features is a ten element vector with binary values which are extracted from the upper part of each word image.  Down Grid Features is a ten element vector with binary values which are extracted from the lower part of the word image.
  • 10. [0,0,0,1195 ,0,0,0,0,0,0] [0,0,0,1 ,0,0,0,0,0,0] [0,0,0,0 ,0,0,0, 598 , 50 , 33 ] [0,0,0,0 ,0,0,0,1,1,0]
  • 11. Descriptor The Structure of the Descriptor
  • 12.  User enters a query word  The proposed system creates an image of the query word with font height equal to the average height of all the word-boxes obtained through the Word Segmentation stage of the Offline operation.  For our experimental set the average height is 50  The font type of the query image is Arial  The smoothing and normalizing of the various features described before, suppress small differences between various types of fonts
  • 13. The Matching Process
  • 14.  100 image documents created artificially from various texts  Then Gaussian and “Salt and Pepper” noise was added  Implement in parallel a text search engine which makes easier the verification and evaluation of the search results of the proposed system
  • 15. Implementation o Visual Studio 2008 o Microsoft .NET Framework 2.0 o C# Language o Microsoft SQL Server 2005 http://orpheus.ee.duth.gr/irs2_5/
  • 16. Evaluation o Precision and the Recall metrics o 30 searches in 100 document images o Font Query: Arial  Mean Precision: 87.8%  Mean Recall: 99.26%
  • 17. FineReader® 9.0 OCR Program Query Font Name “Tahoma”.  Mean Precision: 76.67%  Mean Recall: 58.42%  Mean Precision: 89.44%  Mean Recall: 88.05%
  • 18.  The query word is given in text and then transformed to word image  The proposed system extract nine (9) powerful features for the description of the word images  These features describe satisfactorily the shape of the words while at the same moment they suppress small differences due to noise, size and type of fonts  Based on our experiments the proposed system performs better in the same database than a commercial OCR package