Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm


Published on

In a number of types of documents, ranging from forms to archive documents and books with annotations, machine printed and handwritten text may be present in the same document image, giving rise to significant issues within a digitisation and recognition pipeline. It is therefore necessary to separate the two types of text before applying different recognition methodologies to each. In this paper, a new approach is proposed which strives towards identifying and separating handwritten from machine printed text using the Bag of Visual Words paradigm (BoVW). Initially, blocks of interest are detected in the document image. For each block, a descriptor is calculated based on the BoVW. The final characterization of the blocks as Handwritten,Machine Printed or Noise is made by a Support Vector Machine classifier. The promising performance of the proposed approach is shown by using a consistent evaluation methodology which couples meaningful measures along with a new dataset.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm

  1. 1. Handwritten and Machine Printed Text Separation in Document Images using the Bag of Visual Words Paradigm Konstantinos Zagoris1,2, Ioannis Pratikakis2, Apostolos Antonacopoulos1, Basilis Gatos3, Nikos Papamarkos21Pattern Recognition and Image Analysis (PRImA) Research LabSchool of Computing, Science and Engineering,University of Salford, Greater Manchester, UK2Department of Electrical and Computer EngineeringDemocritus University of Thrace, Xanthi, Greece3Institute of Informatics and Telecommunications,National Center for Scientific Research “Demokritos” Athens, Greece
  2. 2. Current state-of-the-artThree (3) main approaches Text Line Level Word Level Character Level
  3. 3. Disadvantages Different Page Segmentation Algorithms Incompatible Feature Set
  4. 4. Inspired from InformationBag of Visual Word • Retrieval Theory Model (BoVWM) • An image content is described by a set of “visual words”. • A “visual word” is expressed by a group of local features • Most well-known local feature is the Scale-Invariant Feature Transform (SIFT) • Codebook Creation • A codebook is defined by the set of the clusters • A “visual word” is denoted as the vector which represents the center of each cluster • Codebook is analogous to a dictionary
  5. 5. Each visual entity isBag of Visual Word • described by a BoVWM Model (BoVWM) descriptor • Each SIFT point belongs to a “visual word” • The “visual word” that corresponds to the closest center of the cluster by a distance function (Euclidean, Manhattan) • The descriptor reflects the frequency of each visual word that appears in the image.
  6. 6. Proposed Method Block DescriptorOriginal Page Extraction Classification Final ResultImage Segmentation (BoVW model)
  7. 7. Page Segmentation 1. B. Gatos, I. Pratikakis, and S. Perantonis. Adaptive degraded document image binarization. Pattern Original Image Recognition, 39(3):317– 327, 2006. 2. N. Nikolaou, M. Makridis, B. G atos, N. Stamatopoulos, and N. Papamarkos. Locally Adaptive Binarisation Method [1] Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths. Image and Vision Adaptive Run Length Smoothing Algorithm [2] Computing, 28(4):590– 604, 2010. Final Result
  8. 8. Block Descriptor Extraction This step involves the creation of the block descriptor by utilizing the BoVW model Codebook Properties It must be small enough to ensure a low computational cost. It must be large enough to provide sufficiently high discrimination performance For the clustering stage the k-means algorithm is employed due to its simplicity and speed.
  9. 9. Block DescriptorAn example text block Extraction • The SIFTs are calculated on the greyscale version Initial SIFT keypoints • those SIFTs whose position in the binary image does not match the foreground pixel are rejected • Each of the remaining local features is assigned a Visual Word from the Codebook • a Visual Word Descriptor is formed based on the Final SIFT keypoints appearance of each Visual Word of the Codebook in this particular block
  10. 10. Decision System a classifier decides if the block contains handwritten or machine printed text or neither of the above (noise) Based on the Support Vector Machines (SVMs) Conventional approach – one against one, one against others Train two SVMs with the Radial Basis Function (RBF) kernel The first (SVM1) deals with the handwritten text problem against all the other the second (SVM2) deals with the machine printed text problem against all the other.
  11. 11. Decision System Algorithm
  12. 12. Decision System Algorithm Support Vector D1 D2 Sample Sample Support VectorSVM1 (Handwritten Text) SVM2 (Machine-printed Text)
  13. 13. Examples Original Image Output of the proposed method
  14. 14. Evaluation Datasets 103 modified document images from the IAM Handwriting Database 100 representative images selected from the index cards of the UK Natural History Museum’s card archive (PRImA-NHM) The ground truth files adhere to the Page Analysis and Ground-truth Elements (PAGE) format framework http://datasets.primaresearch.org
  15. 15. Evaluation The F-measure of each method.Dataset IAM PRImA- NHMUpper Bound (ProposedSegmentation) 0.9887 0.7985Proposed Method (ProposedSegmentation and BoVW) 0.9886 0.7689Gabor Filters (ProposedSegmentation and Gabor Filters) 0.7921 0.5702
  16. 16. Page Segmentation Problems Binarization Failures Noise – Text Overlapping Handwritten – Machine text Overlapping
  17. 17. Thank You!Ευχαριστώ!