Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. Implementation of Devanagari OCR on Cell Broadband Engine Team Details Name Amulya K (Team Leader) Astha Goyal Kodamasimham Pridhvi Mehak Dhiman Roll Number MT2012017 MT2012028 MT2012066 MT2012080 Email Id Technical Report IIITB-OS-2012-10b April 2013 Contact No 9880345463 9740126090 8123160887 9535942619
  2. 2. Abstract Devanagari OCR is an Optical Character Recognition system which takes scanned images of Devanagari text as input and converts them to corresponding editable text files. The major modules in the system are segmentation of the image, feature extraction of the segmented image, recognition using these features and then mapping the segmented image to its corresponding unicode format, to obtain an editable document. In this project, an open source system developed under GNU public license for English OCR, called GOCR system has been used. It has been modified GOCR system [1] such that it works for Devanagari script. The code now converts Devanagari documents in the form of images, to editable Unicode text format. Processing the images and converting it to the corresponding Unicode format takes a lot of processing time if implemented on a single processor. This is where Cell Broadband Engine (CBE) comes into picture. It is a multicore processor having nine microprocessors on which data can be processed parallely which in turn reduces the overall processing time as compared to a single processor system where the images are processed serially. This has been utilized in this project to process the images faster. In a document containing several images, each image is assigned to a different processor, which pass the image through OCR system parallely. Thus the existing OCR system becomes more efficient by reducing the total processing time. 1
  3. 3. Acknowledgements We would like to express our thanks to Prof. Shrisha Rao for the support and guidance given to us by him in completing this project . We would like to extend our thanks to Mr. Jorg Schulenburg. He is the author of the open source English OCR code named ‘GOCR’, which we have used and modified to suit Devanagari. We would also like to thank the authors of the various papers we have refered. 2
  4. 4. List of Figures 1 CBE Architecture . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Architecture for implementation of OCR on CBE . . . . . . . 12 3 Plot of performance of Devanagari OCR on single core vs multi core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Plot of no. of images vs time taken in serial mode against parallel mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4 3
  5. 5. List of Tables 1 2 Comaprison of performance of Devanagari OCR on single core against multiple cores of the CBE . . . . . . . . . . . . . . . . 15 Comparison of no. of images vs time taken in serial mode against parallel mode . . . . . . . . . . . . . . . . . . . . . . . 16 4
  6. 6. Contents List of Figures 3 List of Tables 4 1 Introduction 6 2 Problem Statement 6 3 Gap Analysis 7 4 Literature Survey 8 5 Architecture 8 5.1 Optical Character Recognition System (OCR) . . . . . . . . . 8 5.2 Cell Broadband Engine (CBE) . . . . . . . . . . . . . . . . . 10 5.3 Overall system Architecture: Implementation of OCR on CBE 12 6 Implementation details 13 7 Results 14 7.1 Recognition rate . . . . . . . . . . . . . . . . . . . . . . . . . 14 7.2 Performance of the OCR system on CBE . . . . . . . . . . . 14 8 Conclusion and future work 17 9 References 18 5
  7. 7. 1 Introduction Devanagari is an age old script used for writing many languages like Sanskrit, Hindi, Marathi, Punjabi, Konkani, Thula etc. This has generated a lot of interest in developing new methods of processing documents in Devanagari. Optical Character Recognition (OCR) is an indispensible part of Document Image Analysis. OCR is basically the process of converting printed or hand-written text into machine-encoded text by recognizing the characters in the document image and converting them to a Unicode format. The output obtained is a parsable document which a machine can interpret and perform some operations or execute certain commands based on the text interpreted. The general applications of this can include helping the blind “read” by converting the text in Unicode format to speech, developing intelligent cars which are able to interpret the signs on the road and navigate on thier own without the intervention of a human being, intelligent robotics, among many others. The Cell Broadband Engine Architecture (CBEA) and the Cell Broadband Engine are the result of a joint venture by Sony, Toshiba, and IBM (STI). The Cell Broadband Engine was initially intended for application in game consoles and media-rich consumer-electronics devices. Implementation of an OCR on a CBE comes with the advantage of fast processing and recognition, owing to the immense parallel processing ability the CBE brings with it. 2 Problem Statement The project is aimed at processing the image of a document to produce machine-encoded text (in Unicode format) as the output. This will be implemented on the Cell processor, for printed text and possibly extended for hand-written text. The challenges which make the problem interesting are, the clarity of the document image provided as the input, the lighting conditions under which the image has been taken, possible degrading of the paper quality due to age and skew in the scanned documents. Also, a significant difference between English and Devanagari for instance, lies in 6
  8. 8. the fact that alphabet in English language does not have anything above or below the line, which is not the case in Devanagari. This project caters to all these challenges. 3 Gap Analysis The following points are noteworthy: • The high performance provided by the CBE can be exploited for large data applications. OCR is one such application, which has a large dataset of characters and sometimes even words, which need to be pre-processed and stored in the database, after feature extraction. The parallelization provided by CBE will make the time taken taken for this process negligible. OCR has not been implemented on CBE before, and the fact that there is no literature about OCR on CBE is a testimony to the same. • Devanagari is a language used widely across a large geographical area. Yet, it is yet to receive the amount of importance that really needs to be attached with it. Whatever research that has happened on Devanagari OCR has happened sparsely and it can be observed that there has been no co-ordination among the individual research groups. Hence it is difficult to clearly pin point the exact state-of-the-art in this field. Also, the number of IEEE papers published in this field in the years 2011 and 2012 put together, is only three. This clearly shows that this area needs to be addressed urgently. • A lot of research on OCR in Devanagari has focused on the Feature Extraction techniques and very few papers have been published on Segmentation techniques. Literature survey has shown that most of the OCR implementations suffer due to errors in the segmentation stage. This usually happens because of improper segmentation due to characters from one line getting joined with the characters on the line below, and so on. This area needs to be addressed in this project. 7
  9. 9. 4 Literature Survey Literature in OCR can be classified into two - that in segmentation and that in recognition. The most widely used method for segmentation is that of horizontal and vertical projection profiles[3] of an image. Kompalli et. al. [4] have proposed a recognition driven framework along with stochastic models, where a graph representation is used to segment characters. This paper also talks about the usage of linguistics and language models. As for recognition, Neural Networks (NN) [5] [6] have been most widely used. Singh et. al. [5] have proposed an OCR system based on artificial NN, where the features include mean distance, histogram of projection based on spatial position of pixels and histogram of projection based on pixel value. [7] talks about topographic feature extraction, where the topographic features are, closed region, convexity of strokes and straight line strokes. These features are represented as a shape based graph which acts as an invariant feature set for discriminating very similar type characters efficiently. [8] proposes a bilingual recognizer based on PCA, followed by support vector classification for a Hindi-Telugu OCR. A good survey paper can be found in [9]. 5 Architecture Figure 2 shows the architecture of for the implementation of the OCR system on CBE. This section will be in three parts: • The first part will describe the exact steps involved and algorithms used in the OCR • The second part will talk about the CBE architecture • The third part will show how the OCR system described above will be implemented on CBE 5.1 Optical Character Recognition System (OCR) An OCR system involves the following four main stages: 8
  10. 10. 1. Pre-processing: Pre-processing involves threshold value detection, the process of converting the input image into a binary image. It also includes filtering of the image in which we have modules to increase contrast, clean the dust and remove coffee mug stains. Removal of serifs ( horizontal lines glued on the lower end of the vertical lines) and making a barely connected character as connected are the other pre processing steps. Algorithm: • Binarization of an image is done by taking a threshold value and converting all intensity values above the threshold to 1 and all intensity values below the threshold to 0. This produces a ‘black and white’ image, which is nothing but the binarized image • Blind pixels i.e dust is removed by using fill algorithm • Serifs are removed by classifying horizontal and vertical pixels by comparing the distance between the next vertical and next horizontal white pixels and then measuring mean thickness of vertical and horizontal clusters and erasing unusually thin horizontal pixels at the bottom line • A 2x2 filter sets pixels to 0 near barely connected pixels hence making them connected 2. Segmentation: Segmentation is the process of splitting up the pre-processed image, first into individual lines, then into each word in a particular line, and finally into individual characters in each word. This step is required as these characters need to be recognized in one of the following steps in order to print it in Unicode format. Algorithm: Horizontal Projection Profile (HPP) and Vertical Projection Profile (VPP) are the two most widely used algorithms for segmentation of a document image. After an image is binarized, each row of the image is summed up to get its HPP. Once the HPP is obtained, there will be a dip in the HPP at places where the image has white spaces between two lines. This will be taken as the reference to segment lines. Each line is then taken and a VPP is taken on that line to get inter-word spaces (and hence a dip in the VPP), and hence word segmentation will be done. For individual characters of a word, a connected component analysis will be done to separate individual 9
  11. 11. characters, where every connected entity (each character in this case) will be labelled as a separate object. Every part of the word below the line will be output as a separate entity and will be sent to a different classifier. 3. Feature Extraction: This is the process of extracting specific characteristics of an image (in this case, a character), which are referred to as it’s “features”. These features will be used to separate one character from the other during the recognition stage. Algorithm: Features which are extracted for recognition are shape based image invariants and other essential features of a character. Other essential features include list of longest lines and bows. It also includes the no. of points and lines connected between these points to make a character. Also the relative position of these points are taken into account. 4. Recognition: Training samples are processed initially and a database of their features is created and stored. Recognition is the process where each character will be labelled. This is done by comparing the same features of each ‘test’ image, against the features of every ‘trained’ image in the database. The class of an image with the maximum match will be reported as the match class. Algorithm: Euclidean distance will be used for matching. Support vector machine (SVM) will act as classifier. 5.2 Cell Broadband Engine (CBE) The IBM Cell Broadband Engine has a multi-core architecture designed for high performance computing and distributed multi-core computing. The architecture features nine microprocessors on single chip. The multi core chip capable of massive floating-point processing is optimized for computing intensive workloads, rich broadband media applications, signal and image processing. 10
  12. 12. Figure 1: CBE Architecture Figure 1 shows the architecture of the Cell Broadband Engine. The Cell processor can be split into four components: • External input and output structures • Power Processing Element (PPE) • Synergistic Processing Elements (SPE) • Element Interconnect Bus (EIB) The Power Processing Element (PPE) which is the main processor, controls and coordinates all remaining 8 co-processing Synergetic Processing Elements (SPEs). SPEs are designed to accelerate media and streaming workloads. Area and power efficiency are important enablers for multi-core designs that take advantage of parallelism in applications. It is on these SPEs that the OCR will run and will be controlled by the PPE. 11
  13. 13. 5.3 Overall system Architecture: Implementation of OCR on CBE Figure 2: Architecture for implementation of OCR on CBE Figure 2 shows the architecture for the implementation of the OCR system on CBE. 1. Training and testing are the most important steps in an OCR system. Training is a one time process, where character samples of the Devanagari script are taken and their features (obtained through a combination of PCA and image-invariant methods) are extracted. This step can be performed on the CBE or the on the local Ubuntu machine itself, as it is a one time process. The features so obtained are stored in a feature database. 2. During testing, an input image (document image) is given as the input to the CBE. The PPE instructs one of the SPEs to carry out the segmentation process. Since there are eight SPEs, mutiple images can 12
  14. 14. be segmented during the time usually taken to segment one image. Segmentation is done by the CBE and that is the intermediate output. 3. The PPE gets the segmented words and instructs the SPE to perform feature extraction. The extracted features will be the output of this step. The PPE stores this in its memory temporarily. This is required as multiple images will be processed at a time. 4. The PPE now fetches the features of an individual image and also the features stored in memory. Comparison of the test features with the trained features will happen across multiple SPEs. The match values obtained by all the SPEs is compared by the PPE and the class matched with the least value (in case of Euclidean distance) is reported as the match class. 5. The output given by the CBE will be the matched class. This will be used to write the corresponding character in Unicode format using the Devanagari fonts present in the system. 6 Implementation details Performance comparison has been done with respect to two aspects: • For different number of lines of input of a single document, the time taken to produce the OCR output using a single core has been compared against the time taken to produce the OCR output using multiple cores of the CBE. The number of lines has been varied from 5 to 70 in increments of 5, and the time taken to run it using a single core has been compared against the time taken to run it on multiple SPUs. • For different number of input documents, the time taken to produce the OCR output using a single core has been compared against the time taken to produce the OCR output using multiple cores of the CBE. The number of input documents given simultaneously to a system has been varied from 1 to 6 (so that each image is processed on one core of the SPU - there are 6 SPUs in the Sony PS3 on which the system has been tested), and the time taken for processing those many documents using a single core has been compared with the time taken to process all of them simultaneously using multiple SPUs. 13
  15. 15. In order to obtain a fair comparison of the time obtained, comparison has to be done using systems with the same RAM capacity and processor speeds. Hence, when reporting time taken to run on a single core, only the PPU of the Cell processor has been used to run the OCR system. This represents serialization, as there is only 1 PPU and given a number of inputs, they have to be processed one by one. When reporting the time taken to run on multiple cores, the PPU along with 6 SPUs of the Sony PS3 system has been used. Here, the PPU takes in the input and creates multiple threads for the SPUs. At a time, 6 input documents can be processed simultaneously on the 6 SPUs, and the time taken for such parallel execution has been reported. 7 7.1 Results Recognition rate The recognition rate obtained for the GOCR system modified for Devanagari is 87%. The recognition rate has been calculated as the ratio of properly recognised words to the total number of words, and has been calculated over 50 documents. 7.2 Performance of the OCR system on CBE Dependency of time taken, on the no. of lines of input Table 1 shows the time taken to produce the OCR output using a single core as against the time taken to produce the OCR output using multiple cores of the CBE, for different no. of lines of input. 14
  16. 16. Table 1: Comaprison of performance of Devanagari OCR on single core against multiple cores of the CBE No. Of Lines Single Processor (in sec) Multi Processor(in sec) 5 81 74 10 182 173 15 277 251 20 365 333 25 478 435 30 560 510 35 694 630 40 1325 1200 45 1480 1340 50 1657 1500 60 1950 1640 70 2334 2130 Figure 3: Plot of performance of Devanagari OCR on single core vs multi core 15
  17. 17. Figure 3 shows the plot of time taken on single core and multiple cores for different no. of lines of input. It can be seen that as the number of lines increase, the performance of the multi core CBE becomes better than that obtained from a single core. That is, the CBE takes much lesser time compared to a single core processor, when the size of the input increases. Dependency of time taken, on the no. of input documents run serially vs parallely Table 2 shows the time taken to produce the OCR output on a single core as against the time taken on multiple cores of the CBE, for different no. of input documents. Table 2: Comparison of no. of images vs time taken in serial mode against parallel mode No. Of Images Serial (in sec) Parallel (in sec) 1 96 95 2 191 194 3 287 266 4 382 368 5 478 435 6 575 523 16
  18. 18. Figure 4: Plot of no. of images vs time taken in serial mode against parallel mode Figure 4 shows the plot of time taken on single core and multiple cores for different no. of input documents. As the number of documents that are run parallely increases, the performance of the CBE system is better than that obtained on a single processor, in terms of the time taken. 8 Conclusion and future work Devanagari OCR has been successfully implemented on a CBE. Since the OCR is implemented in multicore architecture, the efficiency is better than the single core architecture. It was observed that for large input sizes CBE gave better performance as compared to the smaller ones. In the current implementation, the OCR is able to recognize the scanned images of the printed text. More algorithms can also be incorporated into the 17
  19. 19. OCR and the database can be trained accordingly to work for handwritten documents and for different languages. The algorithms can be optimized more to give better recognition rate. 9 References [1] gocr. - Schulenburg [2] Michael Kistler, Michael Perrone, Fabrizio Petrini, “CELL Multiprocessor Communication Network: Built for Speed”, IEEE Micro, MayJune 2006. [3] Manoj Kumar Shukla, Dr. Haider Banka, “An Efficient Segmentation Scheme for the Recognition of Printed Devanagari Script”, IJCST, Vol. 2, Issue 4, Oct.- Dec. 2011. [4] Suryaprakash Kompalli, Srirangaraj Setlur, Venu Govindaraju, “Devanagari OCR using a recognition driven segmentation framework and stochastic language models”, IJDAR (2009), Vol. 12, pp. 123-138, DOI 10.1007/s10032-009-0086-8. [5] Raghuraj Singh1, C. S. Yadav, Prabhat Verma, Vibhash Yadav, “Optical Character Recognition (OCR) for Printed Devnagari Script Using Artificial Neural Network”, International Journal of Computer Science Communication, Vol. 1, No. 1, January-June 2010, pp. 91-95. [6] Anilkumar N. Holambe, Sushilkumar N. Holambe, Dr.Ravinder.C.Thool, “Comparative Study of Devanagari Handwritten and printed Character Numerals Recognition using Nearest-Neighbor Classifiers”, IEEE 2010, 978-1-4244-5540-9. [7] Soumen Bag, Gaurav Harit, “Topographic Feature Extraction For Bengali and Hindi Character Images”, Signal Image Processing : An International Journal (SIPIJ), Vol.2, No.2, June 2011 [8] C. V. Jawahar, M. N. S. S. K. Pavan Kumar, S. S. Ravi Kiran, “A Bilingual OCR for Hindi-Telugu Documents and its Applications”, International Institute of Information Technology, Hyderabad, 2006. [9] R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil, and Umapada Pal, “Offline Recognition of Devanagari Script: A Survey”, IEEE Transactions on Systems, Man, and Cybernetics-Part C: Applications and Reviews, Vol. 41, No. 6, November 2011. 18
  20. 20. [10] U. Pal, T. Wakabayashi, and F. Kimura, “Comparative study of Devanagari handwritten character recognition using different features and classifiers”, in Proc. 10th Conf. Document Anal. Recognit., 2009, pp. 1111-1115. [11] S. Arora, D. Bhattacharjee, M. Nasipuri, D. K. Basu, M. Kundu, and L. Malik, “Study of different features on handwritten Devnagari character”, in Proc. 2nd Emerging Trends Eng. Technol., 2009, pp. 929-933. 19