2. ABSTRACT
For humans, identifying numbers or items in a picture is extremely simple, but how
do you train the machine to recognize these different things in images? Convolutional
Neural Networks (CNN) can solve this problem. In this report, a convolutional neural
network has been trained by identifying pictures in MNIST handwritten digital database
to predict exactly what the numbers in the picture are. Obviously, human beings can
perceive that there is a hierarchy or conceptual structure in the image, but the machine
does not, for example the trained neural network is inconvenient to deal with special
changes in a position of numbers in digital pictures. Exactly put, no matter what the
environment of the image (image background) is, it is unchallenging for human beings to
judge whether there is such a figure in the image and it is unnecessary to repeat the
learning training.
3. 1. INTRODUCTION
In MNIST dataset, the data is already well prepared: the images were centered in a 28x28
image by computing the center of mass of the pixels, and translating the image so as to
position this point at the center of the 28x28 field. The training set is composed of 30,000
patterns from SD-3 and 30,000 patterns from SD-1. And test set was composed of 5,000
patterns from SD-3 and 5,000 patterns from SD-1. Because the dataset is too huge, a
better idea is beginning with an available subset, which could be found on Kaggle
webpage. Although the dataset has been labeled, it’s still my task to design proper
method to preprocessing the data including read binary files with proper programming
language and choose acceptable data structure to avoid overflow. Moreover, a series of
machine learning algorithms were applied to the training set and validated with the test
set, and the models were tuned until the performance and accuracy is relatively
acceptable. Based on my experience on Tensorflow and Keras , I developed this
recognizer with python language. And in order to enhance my engineering skills and
experience on tuning real-world systems instead of toy models, I focused on some details
of the models and implement part of them on my own, but I didn’t reinvent everything.
1.1 THE MNIST DATABASE
The MNIST database of handwritten digits contains 60,000 training examples and 10,000
testing examples, which are 28 * 28 images. All of digits have already been size
normalized and preprocessed and formatted (LeCun et al., 1998). The four files provided
on the website are used in the training and testing for neural networks. In the process of
loading the data set can be directly called from the MNIST database, but due to the
requirements of the assignment, I downloaded these image files from the website that
provides the data set. Because downloading browsers may unzip these image collection
files without your attention, this operation may cause the downloaded files to be larger
than previously mentioned. Thus, if you need to see some problems with the original
image set or data set, you can view the original site of the data set via the link provided
in the reference section of the paper. Due to the use of Python's own data set, simplifying
the section on data preprocessing in the code. The images are all centered in 28 * 28
field.
4. 2. APPROACH
Convolutional neural networks are more complex than standard multi-layer perceptrons,
so we will start by using a simple structure to begin with that uses all of the elements for
state of the art results. Below summarizes the network architecture.
2.1 ConvolutionalNeuralNetworks
Due to the selection of the data set, we decompose the picture into 28*28 blocks of the
same size. According to the original trained neural network, we input a complete picture
into the neural network. But for CNN, the pixel block is directly input this time. The same
neural network weight will be used for every small tile. If any small tile has any
abnormality, we think the tile is interested. In this neural network, there is no order in
which small tiles are disturbed, and the results are still saved in the order of input. Then
we will get a sequence. The part where the picture is stored is interesting. Since the array
is generally large, we will first down sample it to reduce the size of the array. Find the max
value in each grid square in our array. Finally, the column will be inputted into the Fully
Connected Network and the neural network will determine if the picture matches.
5. 2.2 ARCHITECTURE OF CONVOLUTIONAL NEURAL NETWORK
Two layers will be convolution layers the first with 64 channels, a 3 x 3 kernel and Rectifier
Linear Unit (ReLu) function which will feed 64 images into the second layer, while the
second layer will have 32 channels, a 3 x 3 kernel and Rectifier Linear Unit (ReLu)
function and feed 32 images into the third layer. The third layer is the flatten layer to
transform the dimentionality of the image to a 1-Dimension array to connect with the last
layer which contains 10 neurons and the activation function softmax.
6. 3. RESULT
3.1 PERFORMANCE
3.2 TRAIN MODEL
Train the model on the training data set ( X_train and y_train). I will iterate 3 times over
the entire data set to train on, with a number of 32 samples per gradient update for
training. Then store this trained model into the variable hist. I did not specify the number
of samples (batch), by default if the batch isn’t specified, then it is 32.
Batch: Total number of training samples present per gradient update.
Epoch:The number of iterations when an ENTIRE dataset is passed forward and
backward through the neural network only ONCE.
Fit: Another word for train
7. 3.3 TEST MODEL
The model returns only probabilities. So let’s show the probabilities of the first 4 images
in the test set. The probabilities are pretty hard to read. To understand them you must
count find the highest number in the set and then count the index that the number is to
figure out what the label is which is the index number. For example in the image above for
the 3rd image, the highest probability is 9.98755455e-01which means 99.8755% and that
number is located at index 1, so the label is 1. So let’s print the predictions as labels for
the first 4 imagesinstead of probabilitieslike above, and let’s print the actual values / labels
of each image to see how they match up.
8. 4. CONCLUSION& FUTURE SCOPE
The digits patterns used in this training are all black and white images, so this model
may be difficult to deal with common color patterns. The training set of color images
may lead to an increase in the degree of ambiguity in the image, which in turn affects
the accuracy of the prediction. Moreover, in this paper, I only studied the pictures in the
MNIST data set. In actual problems, the specification and definition of the pictures are
undoubtedly important factors affecting the accuracy of the model. Therefore, how to
preprocess is very important, but in order to simplify the research process, I omitted this
step. And in real life, most of the pictures we touch are very complicated and colorful.
Pictures are often background. The digits that you want to train and learn may only
occupy a small part of the complete picture. How to deal with the size of these pictures
and find out which part of the model you want to identify is also a problem. In addition,
the standard item pattern is difficult to be standardized. In the paper we only discuss
simple numbers, but if we need to identify images of some animals, we need to consider
a three-dimensional object from different angles. The difference between the flat images
shown. In the future research, the neural network model also needs to be considered in
combination with actual conditions. We can further study to improve the accuracy of
CNN by improving the issues I mentioned above.