An OCR System for recognition of Urdu text in Nastaliq Font
1. An OCR System for
recognition of Urdu text in
Nastaliq Font
By
S. Hassan Amin
Supervised By
Dr. S. Afaq Hussain
Faculty of Computer Science & Engineering
Ghulam Ishaq Khan Institute of Engineering
Sciences & Technology, Topi-Swabi, 2004
3. Introduction
♦ Urdu is the national language of Pakistan, and is
understood by well over 300 million people
around the world.
♦ There is a need to convert historical database of
Urdu literature into electronic form , so that Urdu
can prosper in the age of computers.
♦ Urdu text recognition endeavors to convert
scanned Urdu documents automatically into
computerized text files.
4. Research Scope
♦ Paper documents have been the most important
means for exchanging information for ages, but this
is changing , as we are rapidly moving towards
paperless society.
♦ It has been estimated by IBM that about $250
billion are annually spent worldwide (largely in
operator salaries, etc.) in keying-in information
from paper documents, and this is the cost of
manually capturing information from only 5% of
the available documents [1].
♦ Urdu Text Recognition
♦ Urdu Text Transliteration
♦ Machine Translation
5. Objectives
♦ The main objective of this research is to make an
OCR system for Urdu language that is effective for
Nastaliq Script irrespective of font size and orientation. To
achieve this objective, there are a number of sub goals
which are:-
To investigate the problem of Urdu OCR in depth, and to
propose new and better ways to solve this problem.
To investigate the use of appropriate set of features for
Urdu OCR.
To establish a database of Urdu ligatures for investigating
the problem of Urdu OCR.
To investigate classification methods that can be useful for
the problem of Urdu OCR.
6. Optical Character
Recognition(OCR)
♦ Character Recognition or Optical Character
Recognition (OCR) is the process of converting
scanned images of machine printed or handwritten
text (numerals, letters and symbols), into a
computer processable format (such as ASCII and
Unicode) [2].
♦ Offline character recognition is performed after
the writing or printing has been completed.
♦ In Online character recognition, computer
recognizes the character as they are drawn(timing
information).
7. Steps in OCR
1. Image Acquistion
2. Preprocessing
3. Segmentation
4. Feature Extraction
5. Classification
6. Post Processing
8. 1. Image Acquistion
♦ This conversion process is accomplished by
digitizer which can be either a
scanner(Offline recognition), Camera, tablet
digitizer(Online recognition).
9. 2. Preprocessing
♦ The preprocessing involves noise reduction,
skew detection,slant normalization,
document decomposition etc.
♦ For slant estimation we have methods such
as Projection method , chain code
method[4].
♦ For estimating skew angle of page , we
have methods such as Orientation
dependent histogram[3].
10. 3. Segmentation
♦ Segmentation is the process of dividing an
image into regions , each susceptible to
containing a single object or a group of
objects of the same type. For instance , an
object can be a character on a text page or a
line segment in an engineering drawing.
♦ In OCR , the commonly used segmentation
algorithms are XY tree decomposition , run-
length smearing and Hough transform.
11. 4. Feature Extraction
♦ Selection of appropriate feature extraction
method is probably the single most
important factor in achieving high
recognition performance [5].
♦ A new comer to the field is faced with the
challenge of selecting appropriate features
for his/her application.
12. Feature Extraction(Contd)
♦ Some useful feature extraction methods in the
field of OCR are :-
1. Geometric Features
2. Structural Features
3. Moment based Features
4. Template Matching
5. Unitary Image Transforms
6. Zoning
7. Contour Profiles
8. Fourier Descriptors
13. 5. Classification
♦ Classification is the process of identifying
each character and assigning to it the
correct character class. Two major
approaches for classification methods are:
1. Decision theoretic method
2. Structural Methods
14. 1. Decision theoretic method
♦ These methods are used when the
description of the character can be
represented numerically in a feature vector.
♦ The principal approaches to decision-
theoretic recognition are minimum distance
classifiers , statistical classifiers and neural
networks.
15. 2. Structural Methods
♦ Within the area of the structural
recognition, syntactic methods are among
the most common approaches.
♦ In Syntactic pattern recognition, measures
of similarity based on the relationship
between structural components are
formulated using grammatical concepts.
16. 5. Post Processing
♦ In Post Processing , we have
1. Grouping
2. Error Detection and Correction
17. 1. Grouping
♦ The result of plain symbol recognition is a set of
individual symbols.
♦ These symbols in themselves usually do not
contain enough information.
♦ We would like to associate the individual symbols
that belong to the same string with each other
making up word and numbers.
♦ The process of performing this association of
symbols into strings is commonly referred to as
grouping.
18. 2. Error Detection and Correction
♦ Along with the grouping of the characters,
another issue to take care is the context in
which each character appears.
♦ Because even the best of the OCR systems
cannot identify each character with 100%
accuracy. These errors may be detected or
even corrected by use of context.
19. Urdu Writing Characteristics
♦ Urdu is a cursive language , which has
evolved from Arabic , Persian and Turkish
languages.
♦ Urdu language has 36,37,42,51 and 53
characters according to different sources[8].
♦ The UZT 1.01 standard has 42 characters.
21. Urdu Writing
Characteristics(Contd)
Characteristics Urdu Arabic Latin Hebrew Hindi
H Justification RL RL LR RL LR
V-Justification Center Base No No Top
Cursive Yes Yes No No Yes
Diacritics Yes Yes No No Yes
# Vowels 2 2 5 11 -
# Letters 37 28 26 22 40
Letter Shapes 1-28 1-4 2 1 1
Complementary
Characters
5 3- - - -
22. Cursive Script Recognition
Schemes
♦ There are two strategies that have been
applied to cursive script recognition. As
mentioned by Amin and Khorsheed [6,7],
they can be categorized as follows:
1. Holistic Strategies in which the
recognition is globally performed on the
whole representation of words and where
there is no attempt to identify characters
individually.
23. Cursive Script Recognition
Schemes(Contd)
1. Analytical strategies in which words are
not considered as a whole, but as
sequences of small size units and
recognition is not directly performed at
word level but at an intermediate level
dealing with these units, which can be
graphemes, segments, Pseudo-letters etc.
24. Research Methodology
♦ Two approaches to recognize Urdu ligatures
printed in Nastaliq Script are presented. Both
these approaches are holistic in nature.These
approaches are tested for identification of a set of
most frequent ligatures printed in Noori Nastaliq
Script. The suggested approaches to recognize
Urdu text are :-
1. Multi-tier Holistic Approach
2. Multi-Stage Classification Approach.
25. Multi-Tier Holistic Approach to
Urdu Nastaliq Recognition
♦ A multi-tier Holistic Approach using feed
forward back propagation neural network
was implemented[12].
27. 1. Segmentation
♦ Connected Component Labeling is applied to the
image of Urdu text.
♦ This technique assigns to each connected
component of binary image a distinct label.
♦ The labels are usually natural numbers from 1 to
the number of connected components in the input
image.
♦ The algorithm scans the image from left-to-right
and top-to-bottom.
28. Segmentation(Contd)
♦ On the first line containing black pixels, a unique
label is assigned to each contiguous run of black
pixels.
♦ For each black pixel, the pixels in its eight
neighborhood are examined, if any of these
pixels has been labeled the same label is assigned
to the current pixel, otherwise a new label is
assigned to it. The procedure continues to the
bottom of the image.
29. Feature Extraction I
♦ In this stage, we extract
some features that will
help us in the recognition
of special ligatures, see
figure. These features are
Solidity, Number of
Holes, Axis Ratio,
Eccentricity, Moments,
Normalized segment
length, curvature, ratio of
bounding box width and
height.
1
2
3
4
5
6
7
8
30. Special Ligature Identification
♦ A Feed forward BPN network is trained on
the feature vectors obtained in the Feature
Extraction I stage. During testing , this
network is used to identify input ligatures as
one of special ligature . If no valid output is
returned , then the ligature is identified as
base ligature.
31. Feature Extraction II
♦ In this stage, special ligatures are associated with
the base ligatures. Special ligature are associated
with the base ligature whose Centroid-to-Centroid
distance is minimum.
♦ A number of lines are grown from the center of
each special ligature, when one of these lines
touches a base ligature, then the special ligature is
associated with that base ligature.
♦ In this stage, due to association of special ligatures
with the base ligatures twenty new features are
added to the feature vector of the base ligature.
32. Classification and Recognition
♦ In this stage, the final feature vector
consisting of 34 features is fed into Feed
Forward Back propagation neural network.
The network architecture consists of 34
inputs, 65 hidden neurons and 45 output
neurons.
33. Multi-Stage Classification Approach
to Urdu Text Recognition
♦ The motivation behind this approach is the
belief , that classification performance
could be improved by combining multiple
classifiers[9,10,11].
34.
35. (Contd)
♦ As shown in the figure , the first three stages are
similar to the multi-tier approach.
♦ Intermediate Classification
In the training phase , we train a competitive network
on feature vectors of base ligatures , to divide input
data into desired number of clusters.
In the training phase , a LVQ/BPN network is trained
on the output of the competitive network , to classify
the input pattern to a particular class or cluster.
In the testing phase, the input feature vector is
presented to the to trained LVQ/BPN network , it gives
us the desired class/cluster.
36. (Contd)
♦ Ligature Identification
A BPN network is trained for all the ligatures
belonging to a particular class/cluster in the
classification and recognition stage of the
system.
38. Frequency Analysis
♦ To establish a database of Urdu images for training and testing, it was
decided that most frequent Urdu ligatures would be identified from the
World Wide Web.
♦ This was a challenge, since most Urdu sites are based on images of Urdu
text, so there was no way of counting Urdu ligatures without first
identifying them.
♦ The BBC Urdu news site http://www.bbc.co.uk/urdu/ was selected for
frequency analysis because it is font-based site of Urdu.
♦ The hex codes of BBC Urdu font were studied.
♦ A study of Urdu font was also done. There are three types of Urdu
characters, given as follows:
1. Characters which do not connect on both sides e.g alif
2. Character which connect on both sides e.g bay, tay
3. Characters which do not connect from the left e.g wow , ray
♦ There are two types of breaks in Urdu text file , one is hard break identified
by 0x0020 and soft break identified by nature of character. On the basis of
these breaks and punctuation marks we decide about separation between
ligatures , and hence keep count of ligatures.
42. Special Ligature Identification
Figure : Importance of Special ligature in identifying ligatures
Network BPN Configuration 52-26-8
Goal 0.01 Mc 0.4 Lr 0.1
Figure : Network configuration used to identify special ligatures
45. Intermediate
Classification(Contd)
Features Used No. of Clusters No. of Images
Moment 1 Solidity Eccentricity
Axis
Ratio 4 216
Neural Net
Used BPN Configuration 64-32-4
Percentage Distribution of Clusters
Cluster
1 Cluster 2 Cluster 3 Cluster 4
16.67 29.63 27.78 25.93
Figure : Network Configuration
50. Conclusion
♦ Two different approaches for recognition of Cursive Urdu
text written in Nastaliq Script have been presented.
♦ A set of 1000 most frequent ligature has been identified.
♦ Our approach minimizes the errors due to segmentation by
using segmentation free approach.
♦ By using different types of features, we have improved the
number of ligatures that can be identified.
♦ Classification performance has been improved by
implementing multi-stage classification approach; this
approach is especially useful for large number of
ligatures[9,10,11].
51. Future Directions
♦ A number of possible directions are under consideration
for enhancement of the system for practical use namely,
Study of effectiveness of features used , and to find new features
that can be effective for Urdu OCR.
Enhancement of the number of ligatures used for training.
Addition of Special characters, Numerals and Aerab for
recognition as special ligatures.
Recognition of intonation marks in the document.
Addition of multi lingual support in the system.
52. References
1. http://www.almaden.ibm.com/cs/dare.html
2. Sargur N. Sridhar, Stephen W. Lam, “Character Recognition” .
3. H. Bunke and Wang, “Handbook of character recognition and
document image analysis”, World Scientific.
4. M. Shridhar, F. Kimura,”Segmentation Based Cursive Handwriting
Recognition”, Handbook of Character Recognition.
5. Oivind De Trier, Anil K. Jain and Torfinn, “Feature Extraction
methods for Character Recognition-A Survey”, Pattern
Recognition,Vol 29, No. 4,pp. 641-662, 1996
53. References(Contd)
1. Adnan Amin, “Arabic Character Recognition”, Handbook of
Character Recognition.
2. Mohammad S. Khorsheed, “Structural Features of Cursive Arabic
Script”
3. Muhammad Afzal, Sarmad Hussain,”Urdu Computing
Standards:Development of Urdu Zabta Takhti-WG2 N2413-2-SC2
N3589-2 (UZT) 1.01”
4. L. Xu, A. Krzyzak, and C. Y. Suen ,” Methods of Combining
Multiple Classifiers and their Applications to Handwriting
Recognition,” IEEE Trans. Systems, Man and Cybernetics, vol. 27 ,
no. 4, pp.418-435,1992.
5. T.K. Ho, J.J. Hull and S. N. Srihari, ” Decision Combination in
Multiple Classifier Systems,” IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 16, no. 1, pp. 66-75,1994.
54. References(Contd)
1. K. Kittler, M. Hatef, R P. W. Dutin and K.
Matas, “On Combining Classifiers,” IEEE
Trans. Pattern Analysis and Machnie
Intelligence, vol. 20, no. 3 pp. 226-239, 1998.
2. Syed Afaq Husain, S. Hassan Amin,” Multi-Tier
Holistic Approach to Urdu Nastaliq
Recognition,” IEEE INMIC Dec. 2002, Karachi.