Bangla Handwritten Digit Recognition Methods CNN

Bangladesh Army University of Science and Technology
(BAUST)
Department of Computer Science and Engineering
Assignment #1, Winter 2023 Level-4 Term-I
Course Code: CSE 4131 Course Title: Artificial Neural Networks and Fuzzy
Systems
Submission Date: CO Number: CO2 Full Marks: 15
ID: 200101103 Name: Khondoker Abu Naim
Bangla Handwritten Digit Recognition
1. Introduction
Bangla handwritten digit recognition is a classical problem in the field of computer vision. There
are various kinds of practical application of this system such as OCR, postal code recognition,
license plate recognition, bank checks recognition etc. Recognizing Bangla digit from documents
is becoming more important. The unique number of Bangla digits are total 10. So the recognition
task is to classify 10 different classes. The critical task of handwritten digit recognition is
recognizing unique handwritten digits. Because every human has his own writing styles. But our
contribution is for the more challenging task. The challenging task is about getting robust
performance and high accuracy for large, unbiased, unprocessed, and highly augmented “bangla-
digit” dataset. The dataset is a combination of ten class datasets that were gathered from different
sources and at different times containing blurring, noise, rotation, translation, shear, zooming,
height/width shift,brightness, contrast, occlusions, and superimposition. We have not processed
all kinds of augmentation of this dataset. We have processed blur and noisy images mainly. Then
our processed image are classified by a deep convolutional neural network (CNN).
2. Literature Review
2.1 Method 1
Proposed Method:
The purpose of OCR is to recognize and identify characters in images of text documents and map
them to computer-readable character codes that can be used for further text processing. A typical
workflow for recognizing characters from image documents is shown in FIG. This includes the
following steps:
1) Preprocessing: The input image goes through a series of preprocessing or preprocessing steps.
The purpose of preprocessing is to allow the OCR Engine to work with greater accuracy. This
can be achieved through a series of operations.
a) Binarization: The document image is thresholded to convert the grayscale image to a
binary image. Image thresholding can be global or local (adaptive). Global image
thresholding uses only one threshold for the entire image, whereas local (adaptive)
thresholding uses different thresholds for different image segments according to local
information.

b) Noise Reduction: Noise reduction improves image quality. Usually two common
approaches are taken for noise reduction: 1) image filtering such as wiener filter, Gaussian
filter, and median filter, and 2) morphological operations such as erosion and dilation.
c) Normalization: Normalizing inter-user and intra-user variability due to character size or
choice of font family such as boldface is always a good idea. Common normalization steps
include stroke width normalization or thinning, and normalization of aspect ratio and size of
the image.
d) Skew correction: Skew correction methods are employed in order to align the image
document. Major approaches for skew detection include correlation, projection profiles, and
Hough transform.
e) De-skew: The skew of handwritten text is user dependent. The Slant elimination method is
used to reduce variability due to different typefaces and normalize all characters to a
canonical form.
2) Segmentation: The purpose of image segmentation in OCR systems is to extract isolated
characters from image documents. The segmentation step includes the following operations: text line
detection, word extraction, and character segmentation. The segmentation of identified characters is
usually performed in a top-down manner . Line segmentation is performed first, then word
segmentation, then character segmentation
3) Feature Extraction: In the feature extraction step, the segmented characters are transformed into a
set of features called feature vectors. Each character is represented by its feature vector. Feature
extractionprovides dimensionality reduction to extract relevant information from character images to
facilitate better separation and identification of different characters in feature space.
4) Classification: Classification schemes provide decision rules for identifying characters based on
feature vectors. This task can be accomplished by leveraging machine learning approaches such as
Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Hidden Markov Models (HMM),
Support Vector Machines (SVM), and standard classifiers.
5) Post-processing: Dictionary-based approaches and contexts can be used to improve recognition
rates. B. Correct spelling errors and select good words.
Result: The recognition accuracy of digit recognition for different feature sets by is based on the
dimension of the zone. For the original 32×32 image (without zoning), the detection accuracy was
rather poor at 78.5%. However, applying zoning to the character image significantly improves the
recognition accuracy of the 16x16 zoning gives a recognition accuracy of 86.5 and 8x8 zoning gives
a best accuracy of 94.0%. If the dimensionality of the zone is further reduced, the detection accuracy
will be lower. For example, 4x4 zoning gives an accuracy of 89.2. This result reflects the fact that
while zoning can help reduce feature dimensionality, as discussed in Section III, excessive feature
reduction can reduce recognition accuracy. For the feature set (8×8 zones) with the best performance,
we also calculated the recognition accuracy for each digit separately. This shows that digits such as 4
and 8 have very high accuracy, while other digits such as 3 and 5 have relatively poor recognition
accuracy.
Limitation: Bangla numeral recognition. The method demonstrates an excellent result with 94%
overall accuracy. This result is very promising, and is likely to improve if pre-processing techniques
such as normalization, skew correction, and slant removal are applied. Further improvement may be

achieved with the use of appropriate features specific to the Bangla digits, and different variants of
SRC such as regularized SRC and kernel SRC. Comparison with other conventional classifiers
should be considered in future as a continuation of this work. The results should also be verified for
other standard handwritten character databases such as the ISI database of handwritten Bangla
numerals
2.2 Method 2
Proposed Method: This latest CNN model that is proposed here is called "MathNET" has several
phasesas illustrated beneath.
Dataset: In this CNN model mentioned in record,6000 image (0-9) data from 'Ekush' [8] and 44 other
classes of mathematical symbols collected a total of 26,400 images.These 44 handwritten symbols
were collected by 500 students. Image clarity is based on character size. His background padding of
black in each image is less and the text is white. The images in this data set have an undistorted size
of 28 x 28 pixels, and the edges of the images appear blurred. Then concatenate the two datasets to
get a final dataset with a total of 32,400 images.
Preparation of dataset: In deep learning, variety of data inside dataset is very important.Then resize
the dataset 28x28px, remove the unnecessary black pixel and converted the whole dataset in csv
format for high speed calculation process. The whole data set has 785 columns in every row. Where
28 x 28 = 784 columns contain the pixel or dot value which represent the image and 785 number
columns store the label or class for the digits and symbols.
This model has Maxpool layer, completely attached Dense layer and used Dropout [9] for
regularization method.The first two convolutional layer has filter size of 32 and kernel_size (5,5) and
use activation function ReLU with padding = `same`. The output of dropout_1 goes into layer
conv2d_3 and conv2d_4 as an input. Max_pooling2d_2 layer which is take input from conv2d_3 and
conv2d_4 and gives the output as an input to 25% dropout_2 layer. After performing these 8
operations, the output goes through flatten_1 layer and attached to a dense_1 layer with 256
backstage units.
In this MathNET model refer to used RMSprop [10] [11] optimizer and set learning rate value to 0.
The RMSprop optimizer is equivalent to the momentum gradient descent algorithm. The RMSprop
optimizer limits vertical direction of the oscillations.
Moreover, this can accelerate the learning rate and our algorithm will take bigger steps in a more
rapidly converging horizontal direction.
CNN model works better when it finds a lot of data during training time. Here comes the data
augmentation method. It helps to generate artificial data, to avoid the overfitting of model. By
choosing several augmentation methods these are: zoom_range set to 0.1, haphazardly shift images
horizontally 0.1, haphazardly shift images vertically 0.1.
Result:
Limitation: Finding the delusion from given test set, can be declare that MathNET has been
successfully recognize 97% of the images from test data. On fig 4 top 6 error has been shown. This is
happened because of the wrong labeled data in the test set. And some of the error also confuse us this
can also be made by human.

2.3 Method 3
Proposed Method: The digit recognition process is mainly divided into three main parts:
preprocessing, feature extraction, and classification.
Preprocessing: The steps performed before feature extraction are called preprocessing. The purpose
of preprocessing is to improve the image data to suppress unwanted distortions or to enhance
important image features for further processing. Preprocessing steps include image acquisition,
binarization, denoising, skew detection, segmentation, and scaling .
1) Image Capture: Anydevice with a camera or scanner can capture images [4]. Images from PDF
files can also be imported into the system. The image is a single digit or a series of numbers
collected from license plates, bank checks, zip codes, etc.
2) Image binarization: RGB images are converted to grayscale before binarization. Binarization
is performed based on based on a fixed threshold using Otsu's threshold method .
3) Denoising: Denoising is performed to reduce the possibility of misclassification due to poor
image quality. Here a median filter is used for noise reduction [6]. This is the commonly used
smoothing method.
4) Skew Detection: Skew is usually caused by the image being placed at an angle when it is
captured. Skew is usually removed by rotating the image to an angle opposite the estimated skew
value includes line splitting and character splitting. Line segmentation is performed bhorizontally
scanning the image for a number of white pixel frequencies in each original image.Next, digit
segmentation is performed by scanning each line vertically, gaps between digits are detected, and
subimages are saved.
6) Scaling: To compare feature vectors, all digits must be scaled to a certain size. As the size of
the image increases, more features can be extracted, thus increasing the accuracy of the. memory
requirements and the time taken have also increased by. In contrast, the smaller image has less
features, resulting in less accuracy. All images are scaled to 32x32 matrices to balance the feature
size and processing time.
Result: The result is summarized in Table 3 Shows that for a very large number of training features
linear SVM works very much efficiently. But if the feature size is smaller than the number of
observation then RBF or polynomial kernel is preferred because they fit this kind of data set
properly, resulting in higher accuracy than linear SVM.
Limitation: In this paper, comparative performance of three well-known kernels of SVM
classification algorithm has been investigated to find out the appropriate kernel function for used
sample dataset of Bangla handwritten digits. Experimental result shows that using HOG features,
handwritten digit recognition shows at most 97.08% accuracy for polynomial kernel function. This
performance mostly depends on the preprocessing and feature extraction techniques. However, the
recognition rate can be improved using the combination of more than one feature extraction
techniques.
2.4 Method 4
Proposed Method:
A.Dataset preparation and image preprocessing In this study we`ve trained our model with
recently developed large dataset NumtaDB consists of 85000+ data and trained with 72040

specimens from the dataset initially. Before feeding data into the model we`ve done some image
preprocessing tasks to clean unnecessary features and artifact as much as possible to train
efficiently. At first we`ve converted images from RGB to grayscale images then reshaped into
64x64x1 dimension to maintain same volume among all training data. Then we`ve applied
Gaussian blur on the image with a standard deviation of 10. After that, blurred images have been
blended with the grayscale images again using cv2. 5 for blurred image. Peprocessing has been
applied on all train and test images.
B. Image Augmentation: Training data in provided dataset is cleaner and most of them are easily
comprehensible but the validation data or test data have some of the most challenging test cases
to evaluate model performance in most noisy condition. So we had to artificially generate or
augment our dataset to increase the variation with built-in augmentation and image preprocessing
functions from Keras library initially 0.2, height shift range of 0.2. Later we improved accuracy
by increasing main database by generating more augmented image manually for more variation.
Images with salt and pepper noise have been generated using MATLAB function imnoise().
Blurred washed out images with random angle ranging -35,-30,20,10,20,30,40 degree have been
generated using cv2. It has applied normalize box filter on image.
C. Proposed model for classification In this method we`ve experimented with different CNN
models and taken two of our best performing models (Model A and Model B) for ensembling.
First convolutional layers consist of 32 filters with 5x5 kernel size generally extract low level
features like vertical horizontal edges at greater extend followed by second layers consist of 32
filters with 3x3 kernel size in both models. After that, Maxpool layers with 2x2 kernel size and
strides of 2 are employed to reduce the features by taking maximum value which greatly cut the
computation curve and overfitting. Similarly, two convolutional layers consists of 64 filters each
with 3x3 kernel size and Maxpool layers are added similar to the previous Maxpool layer`s
configuration. Experimenting with different configurations, we have eventually found that, using
slightly wider convolutional layer at the end of Model A has offer slight accuracy boost in
Model. Rectified Linear Unit (ReLu) activation function is used in every layer including fully
connected (FC) layers except the final FC layers in both models. The convolution feature maps
are flattened and connected with FC layer with 64 neurons. Dropout layers are added before the
final FC layers with value of 0. Finally FC layers of 10 neurons are added with Softmax
activation function for classification of 10 classes and ensembled the models by averaging the
final output layers. Same padding configuration is used in all convolutional layer in both models.
Result: In this training, 20 percent of 116395 specimens have been used as validation set for
determining model performance and other 96116 specimens were used for training. We`ve also
compared the result by training with original non augmented data size of 72000+ to reflect the
performance comparison among our proposed method and other nonaugmented machine learning
and feature extraction based approach. The model has been implemented with python library
namely Keras v2.4. 2 python library and MATLAB image processing toolbox from MATLAB
r2017a have been used for manual augmentation. 50 GHz, RAM: 8.00 GB, Graphics: NVIDIA
GT-940MX, 2GB) and Google Colaboratory [24] Cloud platform with NVIDIA Tesla K-80 GPU
and 12GB RAM support. We`ve tested with various iteration level and found that minimum 30
and 6 epochs are needed to get the maximum performance in Model A and Model B respectively.
We conclude that our proposed method performs worst in detecting numeral '১' which is
misclassified in 57 cases among 1750 specimens. That’s mean lowest 96.74% accuracy. After
that numeral ‘৯’ has second lowest detection rate with 97.360% accuracy. Our proposed model

confuses these two numerals and misclassified '১' as '৯' 26 times and ‘৯’ as ‘১’ 25 times because
numeral '১' and '৯' are sometimes might be bit confusing even in human eyes depending on test
cases. Numeral '৪' has highest detection rate misclassified in only 25 specimens among 1774 test
cases (98.59% accuracy). Some of the test specimens are very confusing even for the human
eyes. among the 17760 specimens only 570 test cases are misclassified. Most of the misclassified
specimens are heavily augmented, noisy data. The model has 96.788%.
Limitation: 98.98% accuracy with image augmentation though he had test images which were not
so noisy. Our proposed model outperforms the previous works on clear images, where we have
achieved 99.2, also very good accuracy with beyond 90% in noisy, highly augmented specimens.
Before augmenting, we can see, the accuracy has been very low for tilted, random box noise and
color shifted specimens. For tilted images the accuracy has been only 13. After augmentation,
accuracy has jumped to 95. As shown in Table VI, our proposed model outperforms some well
stablished models like resnet-18 and lenet-5. Also we compare our model with another ensemble
technique where we`ve trained Model A with 5 fold in Kfold cross validation and got 5 different
model with 5 fold in same architecture but our proposed model outperforms it also.
Result and Analysis
Font Family: Times New Roman, Font Size: 12, Justified
Compare the results of all four methods and write your analysis.
3. Conclusion
Font Family: Times New Roman, Font Size: 12, Justified
References
[1] Khan, Haider Adnan, Abdullah Al Helal, and Khawza I. Ahmed. "Handwritten bangla
digit recognition using sparse representation classifier." 2014 International Conference on
Informatics, Electronics & Vision (ICIEV). IEEE, 2014.
[2] Shuvo, Shifat Nayme, et al. "MathNET: using CNN bangla handwritten digit,
mathematical symbols, and trigonometric function recognition." Soft Computing Techniques
and Applications: Proceeding of the International Conference on Computing and
Communication (IC3 2020). Springer Singapore, 2021.
[3 Rehana, Hasin. "Bangla handwritten digit classification and recognition using SVM
algorithm with HOG features." 2017 3rd International Conference on Electrical Information
and Communication Technology (EICT). IEEE, 2017. .
[4] Noor, Rouhan, Kazi Mejbaul Islam, and Md Jakaria Rahimi. "Handwritten bangla
numeral recognition using ensembling of convolutional neural network." 2018 21st
international conference of computer and information technology (ICCIT). IEEE, 2018.

Bangla Handwritten Digit Recognition Methods CNN

Recommended

Recommended

More Related Content

Similar to Bangla Handwritten Digit Recognition Methods CNN

Similar to Bangla Handwritten Digit Recognition Methods CNN (20)

More from KhondokerAbuNaim

More from KhondokerAbuNaim (11)

Recently uploaded

Recently uploaded (20)

Bangla Handwritten Digit Recognition Methods CNN