An OCR System for recognition of Urdu text in Nastaliq Font

3,447 views
3,326 views

Published on

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,447
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
135
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

An OCR System for recognition of Urdu text in Nastaliq Font

  1. 1. An OCR System for recognition of Urdu text in Nastaliq Font By S. Hassan Amin Supervised By Dr. S. Afaq Hussain Faculty of Computer Science & Engineering Ghulam Ishaq Khan Institute of Engineering Sciences & Technology, Topi-Swabi, 2004
  2. 2. Layout ♦ Introduction ♦ Research Scope ♦ Objectives ♦ Optical Character Recognition Steps in OCR ♦ Urdu Writing Characteristics ♦ Cursive Script Recognition Schemes ♦ Methodology Multi-Tier Holistic Approach Multi-Stage Classification Approach ♦ Results and Discussion ♦ Conclusion ♦ Future Directions ♦ References
  3. 3. Introduction ♦ Urdu is the national language of Pakistan, and is understood by well over 300 million people around the world. ♦ There is a need to convert historical database of Urdu literature into electronic form , so that Urdu can prosper in the age of computers. ♦ Urdu text recognition endeavors to convert scanned Urdu documents automatically into computerized text files.
  4. 4. Research Scope ♦ Paper documents have been the most important means for exchanging information for ages, but this is changing , as we are rapidly moving towards paperless society. ♦ It has been estimated by IBM that about $250 billion are annually spent worldwide (largely in operator salaries, etc.) in keying-in information from paper documents, and this is the cost of manually capturing information from only 5% of the available documents [1]. ♦ Urdu Text Recognition ♦ Urdu Text Transliteration ♦ Machine Translation
  5. 5. Objectives ♦ The main objective of this research is to make an OCR system for Urdu language that is effective for Nastaliq Script irrespective of font size and orientation. To achieve this objective, there are a number of sub goals which are:-  To investigate the problem of Urdu OCR in depth, and to propose new and better ways to solve this problem.  To investigate the use of appropriate set of features for Urdu OCR.  To establish a database of Urdu ligatures for investigating the problem of Urdu OCR.  To investigate classification methods that can be useful for the problem of Urdu OCR.
  6. 6. Optical Character Recognition(OCR) ♦ Character Recognition or Optical Character Recognition (OCR) is the process of converting scanned images of machine printed or handwritten text (numerals, letters and symbols), into a computer processable format (such as ASCII and Unicode) [2]. ♦ Offline character recognition is performed after the writing or printing has been completed. ♦ In Online character recognition, computer recognizes the character as they are drawn(timing information).
  7. 7. Steps in OCR 1. Image Acquistion 2. Preprocessing 3. Segmentation 4. Feature Extraction 5. Classification 6. Post Processing
  8. 8. 1. Image Acquistion ♦ This conversion process is accomplished by digitizer which can be either a scanner(Offline recognition), Camera, tablet digitizer(Online recognition).
  9. 9. 2. Preprocessing ♦ The preprocessing involves noise reduction, skew detection,slant normalization, document decomposition etc. ♦ For slant estimation we have methods such as Projection method , chain code method[4]. ♦ For estimating skew angle of page , we have methods such as Orientation dependent histogram[3].
  10. 10. 3. Segmentation ♦ Segmentation is the process of dividing an image into regions , each susceptible to containing a single object or a group of objects of the same type. For instance , an object can be a character on a text page or a line segment in an engineering drawing. ♦ In OCR , the commonly used segmentation algorithms are XY tree decomposition , run- length smearing and Hough transform.
  11. 11. 4. Feature Extraction ♦ Selection of appropriate feature extraction method is probably the single most important factor in achieving high recognition performance [5]. ♦ A new comer to the field is faced with the challenge of selecting appropriate features for his/her application.
  12. 12. Feature Extraction(Contd) ♦ Some useful feature extraction methods in the field of OCR are :- 1. Geometric Features 2. Structural Features 3. Moment based Features 4. Template Matching 5. Unitary Image Transforms 6. Zoning 7. Contour Profiles 8. Fourier Descriptors
  13. 13. 5. Classification ♦ Classification is the process of identifying each character and assigning to it the correct character class. Two major approaches for classification methods are: 1. Decision theoretic method 2. Structural Methods
  14. 14. 1. Decision theoretic method ♦ These methods are used when the description of the character can be represented numerically in a feature vector. ♦ The principal approaches to decision- theoretic recognition are minimum distance classifiers , statistical classifiers and neural networks.
  15. 15. 2. Structural Methods ♦ Within the area of the structural recognition, syntactic methods are among the most common approaches. ♦ In Syntactic pattern recognition, measures of similarity based on the relationship between structural components are formulated using grammatical concepts.
  16. 16. 5. Post Processing ♦ In Post Processing , we have 1. Grouping 2. Error Detection and Correction
  17. 17. 1. Grouping ♦ The result of plain symbol recognition is a set of individual symbols. ♦ These symbols in themselves usually do not contain enough information. ♦ We would like to associate the individual symbols that belong to the same string with each other making up word and numbers. ♦ The process of performing this association of symbols into strings is commonly referred to as grouping.
  18. 18. 2. Error Detection and Correction ♦ Along with the grouping of the characters, another issue to take care is the context in which each character appears. ♦ Because even the best of the OCR systems cannot identify each character with 100% accuracy. These errors may be detected or even corrected by use of context.
  19. 19. Urdu Writing Characteristics ♦ Urdu is a cursive language , which has evolved from Arabic , Persian and Turkish languages. ♦ Urdu language has 36,37,42,51 and 53 characters according to different sources[8]. ♦ The UZT 1.01 standard has 42 characters.
  20. 20. Urdu Writing Characteristics(Contd) Figure : Urdu Character Set UZT 1.01
  21. 21. Urdu Writing Characteristics(Contd) Characteristics Urdu Arabic Latin Hebrew Hindi H Justification RL RL LR RL LR V-Justification Center Base No No Top Cursive Yes Yes No No Yes Diacritics Yes Yes No No Yes # Vowels 2 2 5 11 - # Letters 37 28 26 22 40 Letter Shapes 1-28 1-4 2 1 1 Complementary Characters 5 3- - - -
  22. 22. Cursive Script Recognition Schemes ♦ There are two strategies that have been applied to cursive script recognition. As mentioned by Amin and Khorsheed [6,7], they can be categorized as follows: 1. Holistic Strategies in which the recognition is globally performed on the whole representation of words and where there is no attempt to identify characters individually.
  23. 23. Cursive Script Recognition Schemes(Contd) 1. Analytical strategies in which words are not considered as a whole, but as sequences of small size units and recognition is not directly performed at word level but at an intermediate level dealing with these units, which can be graphemes, segments, Pseudo-letters etc.
  24. 24. Research Methodology ♦ Two approaches to recognize Urdu ligatures printed in Nastaliq Script are presented. Both these approaches are holistic in nature.These approaches are tested for identification of a set of most frequent ligatures printed in Noori Nastaliq Script. The suggested approaches to recognize Urdu text are :- 1. Multi-tier Holistic Approach 2. Multi-Stage Classification Approach.
  25. 25. Multi-Tier Holistic Approach to Urdu Nastaliq Recognition ♦ A multi-tier Holistic Approach using feed forward back propagation neural network was implemented[12].
  26. 26. (Contd) Figure :Multi-Tier Holistic Approach to Urdu Nastaliq Recognition
  27. 27. 1. Segmentation ♦ Connected Component Labeling is applied to the image of Urdu text. ♦ This technique assigns to each connected component of binary image a distinct label. ♦ The labels are usually natural numbers from 1 to the number of connected components in the input image. ♦ The algorithm scans the image from left-to-right and top-to-bottom.
  28. 28. Segmentation(Contd) ♦ On the first line containing black pixels, a unique label is assigned to each contiguous run of black pixels. ♦ For each black pixel, the pixels in its eight neighborhood are examined, if any of these pixels has been labeled the same label is assigned to the current pixel, otherwise a new label is assigned to it. The procedure continues to the bottom of the image.
  29. 29. Feature Extraction I ♦ In this stage, we extract some features that will help us in the recognition of special ligatures, see figure. These features are Solidity, Number of Holes, Axis Ratio, Eccentricity, Moments, Normalized segment length, curvature, ratio of bounding box width and height. 1 2 3 4 5 6 7 8
  30. 30. Special Ligature Identification ♦ A Feed forward BPN network is trained on the feature vectors obtained in the Feature Extraction I stage. During testing , this network is used to identify input ligatures as one of special ligature . If no valid output is returned , then the ligature is identified as base ligature.
  31. 31. Feature Extraction II ♦ In this stage, special ligatures are associated with the base ligatures. Special ligature are associated with the base ligature whose Centroid-to-Centroid distance is minimum. ♦ A number of lines are grown from the center of each special ligature, when one of these lines touches a base ligature, then the special ligature is associated with that base ligature. ♦ In this stage, due to association of special ligatures with the base ligatures twenty new features are added to the feature vector of the base ligature.
  32. 32. Classification and Recognition ♦ In this stage, the final feature vector consisting of 34 features is fed into Feed Forward Back propagation neural network. The network architecture consists of 34 inputs, 65 hidden neurons and 45 output neurons.
  33. 33. Multi-Stage Classification Approach to Urdu Text Recognition ♦ The motivation behind this approach is the belief , that classification performance could be improved by combining multiple classifiers[9,10,11].
  34. 34. (Contd) ♦ As shown in the figure , the first three stages are similar to the multi-tier approach. ♦ Intermediate Classification In the training phase , we train a competitive network on feature vectors of base ligatures , to divide input data into desired number of clusters. In the training phase , a LVQ/BPN network is trained on the output of the competitive network , to classify the input pattern to a particular class or cluster. In the testing phase, the input feature vector is presented to the to trained LVQ/BPN network , it gives us the desired class/cluster.
  35. 35. (Contd) ♦ Ligature Identification A BPN network is trained for all the ligatures belonging to a particular class/cluster in the classification and recognition stage of the system.
  36. 36. Results and Discussion
  37. 37. Frequency Analysis ♦ To establish a database of Urdu images for training and testing, it was decided that most frequent Urdu ligatures would be identified from the World Wide Web. ♦ This was a challenge, since most Urdu sites are based on images of Urdu text, so there was no way of counting Urdu ligatures without first identifying them. ♦ The BBC Urdu news site http://www.bbc.co.uk/urdu/ was selected for frequency analysis because it is font-based site of Urdu. ♦ The hex codes of BBC Urdu font were studied. ♦ A study of Urdu font was also done. There are three types of Urdu characters, given as follows: 1. Characters which do not connect on both sides e.g alif 2. Character which connect on both sides e.g bay, tay 3. Characters which do not connect from the left e.g wow , ray ♦ There are two types of breaks in Urdu text file , one is hard break identified by 0x0020 and soft break identified by nature of character. On the basis of these breaks and punctuation marks we decide about separation between ligatures , and hence keep count of ligatures.
  38. 38. Frequency Analysis(Contd) S.No. Lig Count S.No. Lig Count 1 ‫ا‬ 2904 11 ‫کا‬ 408 2 ‫ر‬ 1600 12 ‫ہے‬ 377 3 ‫و‬ 1240 13 ‫کر‬ 338 4 ‫کے‬ 745 14 ‫کو‬ 309 5 ‫د‬ 718 15 ‫ہ‬ 295 6 ‫ں‬ 480 16 ‫سے‬ 290 7 ‫کی‬ 469 17 ‫ی‬ 269 8 ‫نے‬ 456 18 ‫ہو‬ 269 9 ‫میں‬ 445 19 ‫س‬ 260 10 ‫ن‬ 439 20 ‫کہ‬ 256 Table : List of 20 most frequent ligatures
  39. 39. 1. Segmentation
  40. 40. Feature Vectors S.No. Name Moment 1 Moment 2 Moment 3 Moment 4 Moment 5 Moment 6 Moment 7 1 1.bmp 0.52283 0.24376 0.00496 0.004624 2.21E-05 0.002274 -5.63E-07 2 10.bmp 0.16563 9.28E-05 0.000277 5.48E-06 2.01E-10 -1.78E-08 -7.16E-11 3 100.bmp 0.16949 0.000171 9.12E-05 3.05E-06 4.94E-11 -2.88E-08 1.26E-11 4 101.bmp 0.64308 0.37256 0.008243 0.005689 3.88E-05 0.003196 3.37E-06 5 102.bmp 0.16488 0.0007 0.000168 6.87E-06 1.89E-10 1.61E-07 -1.37E-10 6 103.bmp 0.40757 0.03951 0.039031 0.031366 0.001002 0.006081 -0.00045 7 104.bmp 0.29624 0.048083 0.000436 8.78E-05 1.66E-08 1.91E-05 4.48E-09 8 105.bmp 0.16481 0.000165 4.32E-05 1.16E-06 6.75E-12 -7.19E-09 4.63E-12 9 106.bmp 0.26849 0.033972 0 0 0 0 0 S.No Name Solidity Minor Axis LengthMajor Axis LengthEccentricityOrientationAxis Ratio 1 1.bmp 0.82051 4.0294 22.8431 0.98432 86.8416 0.1764 2 10.bmp 0.80645 5.7038 6.0321 0.32538 56.7493 0.94558 3 100.bmp 0.75 5.6006 6.0319 0.37135 -16.0531 0.92849 4 101.bmp 0.61702 2.9867 17.0919 0.98461 41.8065 0.17474 5 102.bmp 0.83871 5.4889 6.4133 0.51721 17.1527 0.85586 6 103.bmp 0.44898 16.0802 27.3559 0.809 109.504 0.58781 7 104.bmp 0.66667 5.3315 13.5202 0.91897 1.5099 0.39433 8 105.bmp 0.81081 6.1484 6.6314 0.37467 16.9823 0.92716 9 106.bmp 0.7 6.2487 14.2896 0.89932 -3.3781 0.43729 Figure : Moment based features for some ligatures Figure : Geometric features for some ligatures
  41. 41. Special Ligature Identification Figure : Importance of Special ligature in identifying ligatures Network BPN Configuration 52-26-8 Goal 0.01 Mc 0.4 Lr 0.1 Figure : Network configuration used to identify special ligatures
  42. 42. Special Ligature Identification(Contd) Figure : Training to identify special ligatures
  43. 43. Intermediate Classification Figure : Analysis for identification of clusters
  44. 44. Intermediate Classification(Contd) Features Used No. of Clusters No. of Images Moment 1 Solidity Eccentricity Axis Ratio 4 216 Neural Net Used BPN Configuration 64-32-4 Percentage Distribution of Clusters Cluster 1 Cluster 2 Cluster 3 Cluster 4 16.67 29.63 27.78 25.93 Figure : Network Configuration
  45. 45. Intermediate Classification(Contd) Figure : Training to identify clusters
  46. 46. Feature Extraction II
  47. 47. Ligature Identification Cluster 4 Configuration 80-40-8 Lr 0.1 mc 0.3
  48. 48. Ligature Identification(Contd) Cluster 2 Configuration 80-40-8 Lr 0.1 mc 0.3 goal 0.019
  49. 49. Conclusion ♦ Two different approaches for recognition of Cursive Urdu text written in Nastaliq Script have been presented. ♦ A set of 1000 most frequent ligature has been identified. ♦ Our approach minimizes the errors due to segmentation by using segmentation free approach. ♦ By using different types of features, we have improved the number of ligatures that can be identified. ♦ Classification performance has been improved by implementing multi-stage classification approach; this approach is especially useful for large number of ligatures[9,10,11].
  50. 50. Future Directions ♦ A number of possible directions are under consideration for enhancement of the system for practical use namely,  Study of effectiveness of features used , and to find new features that can be effective for Urdu OCR.  Enhancement of the number of ligatures used for training.  Addition of Special characters, Numerals and Aerab for recognition as special ligatures.  Recognition of intonation marks in the document.  Addition of multi lingual support in the system.
  51. 51. References 1. http://www.almaden.ibm.com/cs/dare.html 2. Sargur N. Sridhar, Stephen W. Lam, “Character Recognition” . 3. H. Bunke and Wang, “Handbook of character recognition and document image analysis”, World Scientific. 4. M. Shridhar, F. Kimura,”Segmentation Based Cursive Handwriting Recognition”, Handbook of Character Recognition. 5. Oivind De Trier, Anil K. Jain and Torfinn, “Feature Extraction methods for Character Recognition-A Survey”, Pattern Recognition,Vol 29, No. 4,pp. 641-662, 1996
  52. 52. References(Contd) 1. Adnan Amin, “Arabic Character Recognition”, Handbook of Character Recognition. 2. Mohammad S. Khorsheed, “Structural Features of Cursive Arabic Script” 3. Muhammad Afzal, Sarmad Hussain,”Urdu Computing Standards:Development of Urdu Zabta Takhti-WG2 N2413-2-SC2 N3589-2 (UZT) 1.01” 4. L. Xu, A. Krzyzak, and C. Y. Suen ,” Methods of Combining Multiple Classifiers and their Applications to Handwriting Recognition,” IEEE Trans. Systems, Man and Cybernetics, vol. 27 , no. 4, pp.418-435,1992. 5. T.K. Ho, J.J. Hull and S. N. Srihari, ” Decision Combination in Multiple Classifier Systems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 16, no. 1, pp. 66-75,1994.
  53. 53. References(Contd) 1. K. Kittler, M. Hatef, R P. W. Dutin and K. Matas, “On Combining Classifiers,” IEEE Trans. Pattern Analysis and Machnie Intelligence, vol. 20, no. 3 pp. 226-239, 1998. 2. Syed Afaq Husain, S. Hassan Amin,” Multi-Tier Holistic Approach to Urdu Nastaliq Recognition,” IEEE INMIC Dec. 2002, Karachi.
  54. 54. Questions ?
  55. 55. Thank You

×