Optical Character Recognition for Urdu Handwriting and Ligatures using Deep Learning

OPTICAL CHARACTER RECOGNITION
FOR
URDU HANDWRITING
Presented by:
SAAD USMAN
USMAN Ali
YASIR HAYAT

INTRODUCTION:
■ Optical character recognition is the process of converting scanned images of
machine printed or handwritten text (numerals, letters and symbols), into a
computer process able format or machine-encoded text.

OCR
■ It is widely used as a form of information entry from printed paper data records,
whether passport documents, invoices, bank statements, computerized receipts,
business cards, mail, printouts of static-data, or any suitable documentation.
■ It is a common method of digitizing printed texts so that they can be electronically
edited, searched, stored more compactly, displayed on-line, and used in machine
processes
■ OCR is a field of research in pattern recognition, artificial intelligence and computer
vision.

EARLIER WORK
■ Early versions needed to be trained with images of each character, and worked on one
font at a time.
■ Advanced systems capable of producing a high degree of recognition accuracy for most
fonts are now common, and with support for a variety of digital image file format
inputs.
■ Some systems are capable of reproducing formatted output that closely approximates
the original page including images, columns, and other non-textual components.

1. Image Acquisition:
digital image acquisition is the creation of photographic images, such as of a physical scene or of the
interior structure of an object.
2. Preprocessing:
Preprocessing consists of text area extraction, text line extraction, baseline detection, component
segmentation, character segmentation, primary and secondary stroke extraction. It also involves noise
reduction, document decomposition etc.
3. Segmentation:
Segmentation is the process of dividing an image into regions , each region to containing a single object
or a group of objects of the same type.

4. Feature Extraction:
Selection of appropriate feature extraction, structured information which is more related to writing like dot,
loops and branches are computed.
5. Classification:
Classification is the process of identifying each character and assigning to it the correct character class.
6. Recognition:
Recognition is the last step in order to achieve our desired output, the extracted features of character is
then match with the stored features to recognize the character.

Over view of Urdu:
 Urdu derived from the mixture of Arabic, Turkish, Farsi and Hindi Languages with 58 character set
defined by National Language Authority Pakistan.
 Each letter has multiple forms depending on its position in the word.
 four forms: isolated , initial , medial and final.
 Urdu characters can be divided into two groups, separators and non-separators .
 The separators or non-joiners can acquire only isolated and final shape. On contrary non-separators
or joiners can acquire all the four shapes

Components and variation in writing:

RECOGNITION OF OFFLINE
HANDWRITTEN ISOLATED URDU
CHARACTER

System Methodology used:
■ Data set:
■ proposed method is applied on 36800 handwritten Urdu characters. For each of 46
characters 200 image samples were used for training and 600 for testing respectively .
■ Methodology:
■ Moment Invariants (MI) are used to evaluate seven distributed parameters for
handwritten isolated Urdu character.

■ Initially verified whether character consists of single component or more than one component.
■ If single component then image is normalized into 60 X 60 and divided into 3 horizontal zones for
features extraction
■ From each zone 7 MI features and from whole image 7 MI features were computed, hence total 28 MI
features.
■ SVM is used for classification
■ character is put into appropriate class as single stork component character.
■ secondary component is normalized into 22 X 22 and divided into 2 horizontal zones

Results:
Achieved overall accuracy up to 93.59% for
all offline handwritten isolated Urdu characters

OPTICAL CHARACTER RECOGNITION SYSTEM
FOR URDU. ONLINE AND OFFLINE OCR
IRRESPECTIVE OF FONTS.

■ Urdu language is also one of the languages which contain the features, properties, scripts and writing
styles of duo languages Arabic and Persian.
■ Urdu script is blend of Naskh, Arabic style and Talique, Persian style called Nastalique script.
■ It is cursive in nature, which makes it more difficult for conventional algorithms to work on it.
■ Online character recognition is a process which is used for handwriting recognition, example
digital pens.
■ Offline character recognition can have both handwriting and printed material. Offline character
recognition mostly used for printed papers, book etc.
■ We have used segmentation free approach in which only ligatures are segmented.

Compound Component Extraction:
■ The whole compound ligature along with its primary and secondary stroke is called compound
component.
■ Each connected component is inspected and is formed according to the following rules:
1. If area of a connected component reside over another connected component than both are the parts of one
compound component.
2. If area of a connected component reside under another connected component than both are the part of one
compound component.
3. If area of a connected component reside over or under another connected component more than 50% than it is
a part of one compound component.
4. If area of a connected component does not reside over or under another connected component than
component it self is compound connected component

Base Line Detection:
■ Base line is the horizontal point where maximum black pixels are present.
■ Connected component that lies on the base line are primary strokes and others are
secondary strokes.
■ average horizontal lines are drawn on 50% and 35% of the height of compound component.

Stroke Identification:
■ Primary and secondary strokes are identified on the basis of base line and average
horizontal lines according to the following rules:
1. Strokes that lie on base line and one of the average horizontal lines are primary strokes.
2. Strokes that lie on both average horizontal lines are primary strokes.
3. Strokes that do not lie on any line are secondary strokes.
4. Strokes that lie only on one average horizontal line are secondary strokes
5. If one stroke lies on base line and other stroke do not lie on any line than base line stroke
is primary and other is secondary.
■ For handwriting we have developed stroke identification improvement algorithm.

Features Extraction:
■ We computed five features for single character or ligature. First of all image is resize into 64x64
pixels after that features are extracted.
1. 8x64 pixels: 8x64 pixel window move from right to left and compute the ratio between white
and black pixels.
2. 64x8 pixels:64x8 pixels window move from top to bottom.
3. 8x8 pixels:8x8 pixels window moves from top right to bottom left.
4. Square shape: Square shape method first read 2x2 pixels from right top to bottom left and if
any black pixel is found than whole 2x2 pixel are converted in black pixels.
5. Hu invariant moments are calculated for character and ligatures.

Recognition:
■ the extracted features are then match with the stored features to recognize the character.
■ We have used K-Nearest Neighbors (KNN) algorithm for features matching.
■ In KNN we have applied Euclidean distance with 10 nearest neighbors.
■ If matched five features independently by KNN, then the maximum same result given by independent
result is the final recognition result.

Results and conclusion:
Developed the proposed OCR system on MATLAB and Microsoft C#.Net.
■ system gives 97.09% accuracy in extracting text lines.
■ Accuracy of 98.86% found in primary and secondary stroke extraction.
■ Recognition gives accuracy of 97.12%.
Proposed a system for both online and offline Urdu OCR system.

CLASSIFICATION OF URDU LIGATURES USING
CONVOLUTIONAL
NEURAL NETWORKS – A NOVELAPPROACH

■ Ligature: word or a sub-word that is a combination of (one
to eight) connected characters.
■ Why Ligature ?
■ Why Neural Network ?

 Data set: 55,000 Urdu ligatures and are extracted
from scanned pages of a famous Urdu book (‘Zawiya’).
 Training: Trained on dataset of 38000 ligature/552 Classes where as
Previous work was on Approx. 10000 ligatures.
 Accuracy: Beat the state of art (93.59%)

Preprocessing:
 Each image is binarized(global thresholding ) and resized(55x55).
 The ligature is then copied at the top left corner of the standard 55x55 image.

Proposed Architecture:
 We have used six convolutional layers stacked with each other
with pooling layers after every two convolutional layers.
 This arrangement yield maximum accuracy
 Finally, there is a single fully connected layer which computes
the class scores

PERFORMANCE COMPARISON OF DIFFERENT CNN
ARCHITECTURES (ACCURACY)

Improvement:
 For High Accuracy more training data
 The realized results show that deeper the network and
smaller the kernel size, better are the recognition rates.
 Addition of Special characters, Numerals and recognition as special ligatures.

SEGMENTATION BASED URDU
NASTALIQUE OCR

Classification:
■ Some letters belong to multiple classes because they contain isolated and
final forms different from initial and medial forms.

End Point:
■ Skeletonized image is then segmented after determining the ending point of the
ligature.
■ In Nastalique it is very difficult to determine the exact starting point of the
ligature so instead of that we start with the ending point of the ligature which is
more deterministic

■ Sixty segments were extracted from all shapes.

Segmentation of the Ligatures:

Results:
 A total of 1692 ligatures, which are formed from the six base forms.
 Got accuracy of 92.73%
 The Urdu words were written in font Noori Nastalique and font size was 36.

Problems:
 Some letters were not recognized correctly may be due to scanning,binarization.

Future/Improvements:
 Increasing the number of training samples so there will be less
chances of error.
 As in previous slide problem by giving original segment as
HMM input instead of giving skeletonized segment.
 Can be extended to cover all set of Urdu letters and for different
font size.

Reference:
■ S. Sardar and A. Wahab, "Optical character recognition system for Urdu," 2010 International
Conference on Information and Emerging Technologies, Karachi, 2010, pp. 1-5.
■ N. Javed, S. Shabbir, I. Siddiqi and K. Khurshid, "Classification of Urdu Ligatures Using Convolutional
Neural Networks - A Novel Approach," 2017 International Conference on Frontiers of Information
Technology (FIT), Islamabad, 2017, pp. 93-97.
■ Pathan, Imran & Ramteke, Rakesh. (2012). Recognition of Offline Handwritten Isolated Urdu
Character. Advances in Computational Research. 4. 117-121.
■ Javed S.T., Hussain S. (2013) Segmentation Based Urdu Nastalique OCR. In: Ruiz-Shulcloper J., Sanniti
di Baja G. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications.
CIARP 2013. Lecture Notes in Computer Science, vol 8259. Springer, Ber.lin, Heidelberg
■ Z. Ahmad, J. K. Orakzai and I. Shamsher, "Urdu compound Character Recognition using feed forward
neural networks," 2009 2nd IEEE International Conference on Computer Science and Information
Technology, Beijing, 2009, pp. 457-462.
■ Naz, Saeeda, Arif Iqbal Umar, Riaz Ahmad, Saad Bin Ahmed, Syed Hamad Shirazi, Imran Siddiqi and
Muhammad Imran Razzak. “Offline cursive Urdu-Nastaliq script recognition using multidimensional
recurrent neural networks.” Neurocomputing 177 (2016): 228-241.

Optical Character Recognition for Urdu Handwriting and Ligatures using Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Optical Character Recognition for Urdu Handwriting and Ligatures using Deep Learning

Similar to Optical Character Recognition for Urdu Handwriting and Ligatures using Deep Learning (20)

Recently uploaded

Recently uploaded (20)

Optical Character Recognition for Urdu Handwriting and Ligatures using Deep Learning