Multitier holistic Approach for urdu Nastaliq Recognition

A Multi-tier Holistic approach for Urdu Nastaliq Recognition
Syed. Afaq Husain* and Syed. Hassan Amin**
Faculty of Computer Science and Engineering
Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology
Topi, 23460, Dist. Swabi, NWFP, PAKISTAN
Email:* syed_a_h@giki.edu.pk_ , **shassan@giki.edu.pk
Abstract
Character recognition is an active area of research
with numerous applications including web publishing,
document analysis and text to speech conversion. In this
paper, we present a new approach for the off-line
recognition of cursive Urdu Text. This methodology has
been developed for the Noori Nastaliq Script [Ahmed 1].
Word (Ligature) based identification has been adopted
instead of character based identification. A multi-tier
holistic approach has been utilized to recognize ligatures
from a pre-defined ligature set. Initially, the special
ligatures (Dots, Tay, Hamza & Mad) are identified from
the base ligatures. These special ligatures are associated to
the most probable neighboring base ligature in the second
step. Finally, the above information along with some other
RTS invariant features of base ligature is presented to the
Feed Forward Back Propagation neural network to
perform the final recognition task.
Keywords: OCR, Urdu Character Recognition, Noori
Nastaliq, Ligature based identification, Back-propagation
Neural Network.
1. Objective
Urdu is the national language of Pakistan. It is a
language that is understood by over 300 million people
belonging to Pakistan, India and Bangladesh. Due to its
historical database of literature, there is a need to devise
automatic systems for conversion of this literature into
electronic form that may be accessible on the world-wide-
web. The suggested Urdu Text recognition system
endeavors to convert scanned Urdu documents
automatically into computerized text files in UZT format.
The Diacritics (Aerab) and punctuation have been
ignored in the current version of the system, however may
be classified as another category of symbols. Multi-Font
and multi-lingual support has also been ignored for
simplification.
2. Introduction
Urdu character set is based on the Arabic
character set. It is a cursive language even in its printed
form. In the past, a lot of research has been done on
automatic recognition of text written in languages based
on Roman [Guyon],[Ha], Chinese text [Guo],[Ding],
Arabic [Amin1] and Persian [Khorsheed3] but no serious
research has ever been published on Urdu text recognition.
Arabic and Persian, which are based on similar basic
characters and writing styles as Urdu, have seen quite
worthwhile research in the past decade. However, those
solutions are not valid to Urdu due to a number of inherent
differences in the script and styles of Urdu text. Nasakh
and Nastaliq are the two most popular writing styles
(scripts) in Urdu and both have their own unique features
that make them different and more complicated than their
close counterparts. The following chart (Table 1)
represents a view of the comparative complexities of Urdu
Script as compared to some other languages.
Like Arabic, recognizing Urdu script presents
challenges of cursive orthography and context sensitive
letter shape [Khorsheed2]. However, in contrast to Arabic
text, in which connected characters follows a base line, the
joined characters in Nastaliq and Nasakh are positioned
according to their preceding, pro-ceding as well as a
vertical justification of the ligature.
Table 1: Comparative features of some languages
The word recognition strategies are generally
classified into three categories, namely Holistic Approach,
Analytic Approach and Feature Sequence Matching.
[Shridher]. However, some researchers regard the
Sequence matching techniques to be a form of Holistic
approach. The analytic approach tries to segment the word
into characters before the recognition task while the
holistic approaches tries to recognize the word or its sub-
part (ligature) as a whole. [Khorsheed1]. The first
approach segment Urdu words into characters, and second
approach segment words into symbols. These symbols
may be character, ligature or possibly a fraction of
character.
In this paper, we present an approach to
recognize commonly used ligatures from Noori Nastaliq
Script developed by Ahmad Mirza Jamil [Ahmed1].
Nastaliq is one of the most beautiful and one of the most
complex scripts. The script was originally created by the
Characteristics Urdu Arabic Latin Hebrew Hindi
H Justification R L R L L R R L L R
V-Justification Centre Base No No Top
Cursive Yes Yes No No Yes
Diacritics Yes Yes No No Yes
# Vowels 2 2 5 11 -
# Letters 37 28 26 22 40
Letter Shapes 1-28 1-4 2 1 1
Complementary
Characters
5 3- - - -

calligrapher Mir Ali Tabrezi. The attempts to mechanize
Urdu script didn’t bear any success for a long time, and as
a result a typewriter that could type in the Nastaliq style, is
not available even today. There are two approaches to
computerizing Nastaliq i.e. Ligature based approach (more
glyphs) and character based approach (more rules). For
example, the word has three ligatures or separate
shapes , and . Noori Nastaliq describes about
20000 ligatures that are required to write almost all words
contained in the Urdu dictionary. Since, the ligature based
recognition is dependent on the ligatures used for training
it has the context information due to which it has a higher
performance. However, it has the disadvantage that adding
new ligatures into the system would require re-training of
the system. E.g. the. Urdu word Computer is one ligature
that is not in the formal dictionary of ligatures though it is
widely written in Urdu text.
3. Character Recognition Schemes
The problem of Urdu text recognition is closely
related to Arabic text recognition. Arabic Text
Recognition Systems generally have following stages:
image acquisition, preprocessing, segmentation, feature
extraction, classification and recognition [Khorsheed3].
The Arabic Text Recognition Systems are further
divided into Segmentation based and Segmentation-free
systems. Here we briefly describe approaches into Arabic
Text Recognition, with the view that these give valuable
insight into problem of Urdu Text Recognition [Bunke].
3.1 Segmentation Free Systems
In these systems, the word is recognized as a
whole without trying to segment and recognize characters
or primitives [7]. One approach for such systems is to
calculate a single feature vector for each word; this feature
vector is then used to recognize the word.
3.2 Segmentation Based Systems
In Segmentation based systems, each word is
further divided into a number of subparts. The
segmentation-based systems are further subdivided into
four categories: Isolated/Pre-segmented characters,
segmenting a word into characters, segmenting a word into
primitives, Integration of recognition and segmentation.
These systems are either impractical because they try to
recognize digits and isolated characters or they have low
recognition rate because of segmentation errors
[Khorsheed2].
4. Ligature Identification System
In our proposed system, after preprocessing, the
text is segmented into a number of ligatures ordered from
right to left and top to bottom. The ligatures at this stage
are defined as every connected set of characters. These
ligatures also contain the special symbols used in Urdu
namely, (Tau, Mad, Dots, Hamza and Ha). A number of
features are calculated and then fed into Feed Forward
Back propagation neural net to recognize special ligatures
from the base ligatures. These special ligatures are then
associated with the base ligature, forming part of the
feature vector used to recognize base ligature, thus aiding
in the recognition of the base ligature. This feature vector
is then used to recognize ligatures using a Feed Forward
Back Propagation neural net.
Figure 1: Stages of Urdu Character Recognition
4.1 Preprocessing
The preprocessing stage involves Smoothing,
Skew detection and correction, Document decomposition,
Slant normalization etc.
4.2 Segmentation
In document image analysis, four commonly used
segmentation algorithms are connected component
labeling, X-Y tree decomposition, run-length smearing,
and Hough Transform.
We have applied Connected Component Labeling
to the image of Urdu text. This technique assigns to each
connected component of binary image a distinct label. The
labels are usually natural numbers from 1 to the number of
connected components in the input image. The algorithm
scans the image from left-to-right and top-to-bottom. On
the first line containing black pixels, a unique label is
assigned to each contiguous run of black pixels. For each
black pixel, the pixels in its eight neighborhood are
examined, if any of these pixels has been labeled the same
label is assigned to the current pixel, otherwise a new label
is assigned to it. The procedure continues to the bottom of
the image [Khorsheed3].
4.3 Feature Extraction I
In this stage, we extract only those features that
will help us in the recognition of special ligatures, see
figure. These features are Solidity, Number of Holes, Axis
Ratio, Eccentricity, Moments, Normalized segment length,
curvature, ratio of bounding box width and height.
Preprocessing
Segmentation
Feature Extraction I
Special Ligature Identification
Feature Extraction II
Ligature Identification

4.3.1 Solidity
Solidity is a scalar quantity. It is defined as the
proportion of the pixels in the convex hull that are also in
the region. It is computed as
Solidity = Ligature Area/ Convex Hull Area
Where,
Ligature Area = ∑∑f (x, y)
For all x, y in the binary image of the ligature
Convex Hull Area = ∑∑f(x,y)
For all x, y in the convex hull of the ligature
4.3.2 Axes Ratio
It is the ratio of the major axis to the minor axis
of the best-fit ellipse of the ligature.
Axis Ratio = a/b
Where a and b are the lengths of semi-major axis and
semi-minor axis of the best-fit ellipse.
4.3.3 Eccentricity
It is the ratio of the distance between the foci of the
best-fit ellipse to its major axis.
Eccentricity = distance btw foci / 2b
4.3.4 Moment based features
These refer to certain functions of moments,
which are invariant to geometric transformations such as,
translation, scaling, and rotation [6]. Such features are
useful in identification of objects with unique shapes,
regardless of their location, size and orientation
4.3.5 Normalized Length Feature
First the normalized length of a segment i is
calculated relative to other segment lengths in the same
word. Then normalized length of the ligature is calculated
as
Normalized Length = ∑ L(i)
4.3.6 Curvature Feature:
In a similar fashion, first the curvature of a segment is
measured by simply dividing the Euclidean distance
between the two feature points of that segment by its
actual length. This feature equals zero when the segment is
a loop and 1 when the segment is a straight line.
C(i) = (Euclidean distance between
endpoints) / segment length
Then curvature feature of the ligature is calculated as a
sum of curvature features of all of its segments.
Curvature Feature = ∑ C(i)
4.3.7 Number of Holes:
This feature gives total number of holes in a ligature.
If feature points of ligature are considered as a set of
vertices V, and segments as a set of edges E, of a graph G
(V, E), then total number of holes in the ligature can be
found using graph theory as following:
Number of Holes = E - Est
Here,
E = Number of edges in G
Est= Number of edges in the spanning tree of G.
A graph with N vertices has N-1 edges in its spanning
tree.
4.4 Special Ligature Identification
For identifying special ligatures, a Feed Forward
Back propagation neural network with 15 inputs, 25
hidden and 25 output neurons was used. The feature
vectors obtained from Feature extraction 1 stage of the
system are fed to this neural network. It then identifies the
ligatures as either special ligatures or base ligatures.
Figure 2: Some special ligatures
4.5 Feature Extraction II
In this stage, we associate special ligatures with
the base ligatures. We associate special ligature with the
base ligature whose Centroid-to-Centroid distance is
minimum. A number of lines are grown from the centre of
each special ligature, when one of these lines touches a
base ligature, then the special ligature is associated with
that base ligature.
In this stage, due to association of special
ligatures with the base ligatures twenty new features are
added to the feature vector of the base ligature.
4.6 Ligature Identification
In this stage, the final feature vector consisting of
34 features is fed into Feed Forward Back propagation
neural network. The network architecture consists of 34
inputs, 65 hidden neurons and 45 output neurons.
5. Results
The system was trained using a training set of
two hundred carefully selected ligatures. The testing was
done on bitmap images containing Urdu written in
Nastaliq font using a text editor.
This simplified the problem by neglecting the
pre-processing stage required for noise removal during
image acquisition. The training set contained the more
simplified and commonly used ligatures.
The performance of the system on images
containing the trained ligatures only was 100 %.. However
incases, where it contained additional ligatures, they were
classified to the closest match in the training set. No
rejection class was utilized.
6. Conclusion
In this paper, we have presented a method for
recognition of Cursive Urdu text written in Nastaliq Script.
The system is currently trained for a small number of
ligatures but has the potential to be expanded to be more
practical use. Our approach minimizes the errors due to
segmentation by using segmentation free approach. By
using multiple classes of features , we have improved the
number of ligatures that can be identified.

7. Future Directions
A number of possible directions are under
consideration for enhancement of the system for practical
use namely,
1. Enhancement of the number of ligatures used for
training.
2. Addition of Special characters, Numerals and Aerab
for recognition as special ligatures
3. Recognition of intonation marks in the document.
4. Addition of multi lingual support in the system.
References
1. [Ahmed] Ahmad Mirza Jamil, “Noori Nastaliq,
Computerized Urdu Calligraphy”, Elite Publishers,
1982.
2. [Amin] A.Amin and S.Al-Fedaghi, “Machine
recognition of printed Arabic text utilizing a natural
language morphology”, Int. J. of Man-machine
Studies 35,6 (1991), 768-788.
3. [Badr] Badr Al-Badr, Robert M. Haralick,
“Segmentation–Free word recognition with
application to Arabic”, IJDAR1(3):147-166(1998)
4. [Bunke] H. Bunke, P. Wang, “Handbook of character
recognition and document image analysis”, World
Scientific, 2000.
5. [Ding] X.Q.Ding, Y.S.Wu, Recognition of multi-font
printed chineses characters, CCIPP/CLCS, 1988,
Toroto, Canada.
6. [Guo] H.Guo, X.Q.Ding, The development of high
performance Chineses/English bi-lingual OCR
system, proc. CMIN ’95, Beijing, China, March 95,
248-253.
7. [Guyon] I.Guyon, J.Bromley, N.Matic, etc, “A neural
network system for recognizing on-line handwriting”,
Models of Neural network, Springer Verlag, 1996.
8. [Ha] J.Y.Ha, S,C. Oh, J.H. Kim, and Y.B. Kwon,
“Unconstrained handwriiten word recognition with
interconnected hidden Markov Models, 3rd Int.
Workshop on Frontiers in Handwriting Recognition”,
Buffalo, May 93, 455-460
9. [Khorsheed1] Mohammad S. Khorsheed, William F.
Clocksin, “Structural features of cursive Arabic
script”, proc of 10th
British Vision Conference,
University of Nottingham, UK, September-1999.
10. [Khorsheed2] M S Khorsheed, ”Off-Line Arabic
Character Recognition A Review”.
11. [Khorsheed3] Mohammad S. Khorsheed, ”Automatic
recognition of words in Arabic manuscripts”, PhD
Dissertation, Churchill College, University of
Cambridge, June 2000
12. [Shridher] N.Shridher, F.Kimura, “Segmentation
based cursive handwriting recognition”, Handbook of
character recognition and document image analysis,
126-127, World scientific, 1997.
13. [Trier] Ovinid Due Trier, Anil K. Jain, and Torfinn
Taxt, “Feature Extraction Methods for Character
Recognition – A Survey”, Pattern Recognition, Vol.
29 , No. 4 , pp. 641-662 , 1996.

Multitier holistic Approach for urdu Nastaliq Recognition

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Multitier holistic Approach for urdu Nastaliq Recognition

Similar to Multitier holistic Approach for urdu Nastaliq Recognition (20)

More from Dr. Syed Hassan Amin

More from Dr. Syed Hassan Amin (11)

Recently uploaded

Recently uploaded (20)

Multitier holistic Approach for urdu Nastaliq Recognition