classifier

Using classifiers efficiently in computer vision

Mutawaqqil Billah
Independent Researcher,
B.Sc in Computer Science and Mathematics,
Ramapo College of New Jersey, USA
Address : 906/2, East Shewrapara, Mirpur,
Dhaka, Bangladesh
Phone: 8801912479175
Email : mutawaqqil02@yahoo.com

All the classifiers work with short sequence of data with little variations. But, in reality, we could have a
large amount of data from where we have to find the patterns to recognize it. For example, finger print,
face and palm recognition and other recognition related issues, in all of these we have large amount of
patterns which need to be recognized. In shape recognition, we can use silhouette which will reduce the
number of feature or gradient points as we will use the outer contour only. Sometimes we use gradient
points to reduce interested points in an image. Even after that we have a large amount of data which we
cannot put in a single classifier easily. We need to break it down into small parts.
In the following section we will discuss different methods by which we can use classifiers for pattern
recognition.
Method one:
In this method, we will use single layer. We will break down the image or data. We can take pixel data or
gradient points only or any other feature points into consideration. We will take each column or row or
diagonal line for each classifier.  We will use these data for classifiers. For example, we will take first
column of every image in training set and collect that data and train the classifier with that data. This
classifier’s responsibility will be just to recognize the first column of any test image. For each column, we
will have one classifier to recognize it.
Method two:
In this method, we will use single layer BCOM. BCOM means block combination.  Here we will use a
block of an image as data for a specific classifier. For example, we will divide the image by 8 by 8 blocks,
all coming side by side one after another. We can also use inner blocks and outer blocks. Inner blocks
are simple small block of an image and outer blocks consist of inner blocks. For inner block, we can take
adjacent data, like 4 by 4 block or collect sixteen points randomly from image and for outer block, we
can choose inner blocks randomly or use adjacent data. Each outer block’s data will be used in a
separate classifier. Each classifier will be trained with same block in all training images and it will be
responsible to recognize the same block of the test image.
We can get the block data from the points inside the block or randomly from image. For example, in an 8
by 8 block, we will have 64 data. We can get the data from adjacent rows and columns. For example, if
we take the first block, then we need to take first row’s 8 values, then second row’s 8 values and
similarly up to eighth row. But, another way is, we will take 64 data in each block but the data will come
randomly from the image and each data will be used only in one block. For example, we want to divide a
1024 by 1024 image by 16 by 16 blocks. Then we will have 4096 blocks. We will fill the first block by
collecting data randomly from the image. Then, second block will use the remaining data for selection
and randomly get the data and similarly all the blocks will get their own data. We can always increase
block size to reduce number of blocks so that we have less classifiers to recognize. We will use classifiers
for each block of data. Classifiers could be one from the list consist of Neural net, HMM, SVM and K
nearest neighbor. One classifier will work with one block’s data in all training set and its responsibility is
to recognize that block or patterns for that block.

This breaking down of image is itself can act as a classifier. We do not need to use HMM or neural net
for them. Each image will have different information on each block. We will use genetic algorithms on it.
Meaning, we will take different combinations and weights, and train the data and test those and take
only the 10% of best performers and produce remaining data of the new population from these 10%.
We will keep doing this until we have a satisfactory result. We will eventually move towards the patterns
in the image.

Method three:
In this method we will discuss single layer with VBCOM. VBCOM means variable length block
combination. It is same as BCOM but the only difference here we have is each block will have different
size. For example, one block will be 16 by 16, another could be 32 by 32, another one 64 by 64 or any
other size. Using different size will be more powerful as it will dynamically find the patterns. We do not
know what is the size of any particular pattern. We do not know what is the area where a specific
pattern lays. It will be very dynamic and frequently changing in natural or real time situation. That is why
VBCOM will produce more accurate result. Because, as we will do GA on VBCOM, it will slowly move
towards the structure of the pattern. In GA, any structure close to the real pattern will produce better
result than others. In every generation in GA, we will move closer to real patterns in any type of dataset
including computer vision and speech recognition. In BCOM, we are restricting the size by assuming all
patterns in the data will occupy same size which will be wrong for most of the real time data. But, in
VBCOM, we have an option to have pattern of any size and GA will help us to find the real patterns. It is
like a blind person finding the best route to reach somewhere by himself. It is a natural way to find logic
or patterns inside a given data set. We will use genetic algorithms on it. Meaning, we will take different
combinations and train the data and test those and take the best 10% and produce remaining data from
these 10%. We will keep doing that until we have a satisfactory result. We will eventually move towards
the patterns in the image.

Method four:
In this method, we will discuss about two layers, first layer for BCOM and second layer for least square
for classifiers. That means genetic algorithm will be used in the first layer and least square method will
be used in the second one. In first layer, BCOM will be used to get the best block combination of the
training data. In the second layer, least square method will be used with one from these classifiers,
HMM, neural net, SVM and K nearest neighbor. When one combination will be chosen in first layer,
second layer will use least square method using that BCOM to get the best result out of some
predefined number of different choices in least square method. In second layer, we will use different
structure for NN or HMM or SVM or K nearest neighbor. If we choose NN, then we will try different
input layer size, hidden layer size and number of hidden layers. First layer will use GA and keep 10% of
best performer and produce children to fill the population and run the process again with new
population. Finally, it will get the best BCOM along with best HMM or neural net or SVM or K nearest

neighbor. When we are done with processing the data, then we will have the best BCOM from first layer
and also the best structure for classifiers with best weights from second layer. For any BCOM, we will
have best classifier structure with weights as we will get that after trying many of those and choosing
the best performer.

Method 5:
In this method, we will discuss two layers, BCOM in first layer and neural net in second layer. That
means two layers of genetic algorithm will be used. In first layer, BCOM will be used to get the best
block combination for the training data. In the second layer, GA will be used to get the best neural net
for a particular BCOM. When one combination will be chosen in first layer, second layer will use that
BCOM to get the best neural net for that BCOM. First layer will use GA and keep 10% of best performer
and produce children to fill the population and run the process again with new population.  Whereas, on
receiving the BCOM, second layer will use GA to get best NN. It will keep 10% of it and produce the
remaining of the population from this best 10%. And run the process again with new population. So,
each BCOM will get a run of GA in second layer. Finally, we will get the best BCOM along with best
neural net.

Method 6:
In this method we will discuss about two layers, BCOM in the first layer and HMM in the second layer.
That means two layers of genetic algorithm will be used. In first layer, BCOM will be used to get the best
block combination for the training data. In the second layer, GA will be used to get the best HMM for
that BCOM. When one combination will be chosen in first layer, second layer will use that BCOM to get
the best HMM for that BCOM. First layer will use GA and create new population from best performer of
previous generation. Whereas, on receiving the BCOM, second layer will use GA to get best HMM. It will
keep 10% of it and produce the remaining of the population from this 10%. And run the process again.
So, each BCOM will get a run of GA in second layer. Finally, it will get the best BCOM along with best
HMM.

Method 7:
In this method we will discuss about two layers, BCOM in the first layer and SVM in the second layer.
That means two layers of genetic algorithm will be used. In first layer, BCOM will be used to get the best
block combination of the training data. In the second layer, GA will be used to get the best SVM for that
BCOM. When one combination will be chosen in first layer, second layer will use that BCOM to get the
best SVM for that BCOM. First layer will use GA and create new population from best performer of
previous generation.  Whereas, on receiving the BCOM, second layer will use GA to get best SVM. It will
keep 10% of it and produce the remaining of the population from this 10%. And run the process again

with new population. So, each BCOM will get a run of GA in second layer. Finally, it will get the best
BCOM along with best SVM.

Method 8:
In this method we will discuss about two layers, BCOM in the first layer and K nearest neighbor in the
second layer. That means two layers of genetic algorithm will be used. In first layer, BCOM will be used
to get the best block combination of the training data. In the second layer, GA will be used to get the
best K nearest neighbor for that BCOM. When one combination will be chosen in first layer, second layer
will use that BCOM to get the best K nearest neighbor for that BCOM. First layer will use GA and keep
10% of best performer and produce children to fill the population and run the process again with all
combinations in new population. Whereas, on receiving the BCOM, second layer will use GA to get best
K nearest neighbor. It will create new population from best performer of previous generation. So, each
BCOM will get a run of GA in second layer. Finally it will get the best BCOM along with best K nearest
neighbor.
For each BCOM, we will use one k nearest neighbor classifier. Once one block’s data will be given to
classifier, we can insert the data directly in the tree or use some weights. For example, for level one, we
know which data belongs to level one. We can use clustering to find out in which cluster it belongs.
While processing the clustering, we can multiply each data in a level with some weights, then find the
total for that and process clustering. We can try different weights for each level and see which one gives
better clustering of the data or works well with training data.

Method 9:
In this method we will discuss about two layers, VBCOM in the first layer and least square method with
one classifier from a list consists of NN, HMM, SVM and K nearest neighbor in the second layer. That
means genetic algorithm will be used in the first layer and least square method in the second. In first
layer, VBCOM will be used to get the best block combination of the training data. In the second layer, GA
will be used to get the best classifier for that VBCOM, whichever is used from a list consists of NN, HMM,
K nearest neighbor and SVM. When one combination will be chosen in first layer, second layer will use
that VBCOM to get the best classifier for that VBCOM. First layer will use GA to create new population
from best performer of previous generation. Whereas, on receiving the VBCOM, second layer will use
least square method to get best classifier. So, each VBCOM will get a run of least square method in
second layer. Finally, it will get the best VBCOM along with best classifier.

Method 10:

In this method we will discuss about two layers, VBCOM in the first layer and neural net in the second
layer. That means two layers of genetic algorithm will be used. In first layer, VBCOM will be used to get
the best block combination of the training data. And in the second layer, GA will be used to get the best
neural net for that VBCOM. When one combination will be chosen in first layer, second layer will use
that VBCOM to get the best neural net for that VBCOM. First layer will use GA and keep 10% of best
performer and produce children to fill the population and run the process again with all combinations in
new population. Whereas, on receiving the VBCOM, second layer will use GA to get best NN create new
population from best performer of previous generation. So, each BCOM will get a run of GA in second
layer. Finally, it will get the best VBCOM along with best neural net.

Method 11:
In this method we will discuss about two layers, VBCOM in the first layer and HMM in the second layer.
In first layer, VBCOM will be used to get the best block combination of the training data. In the second
layer, GA will be used to get the best HMM for that VBCOM. When one combination will be chosen in
first layer, second layer will use that VBCOM to get the best HMM for that VBCOM. First layer will use
GA and create new population from best performer of previous generation. Whereas, on receiving the
VBCOM, second layer will use GA to get best HMM. It will keep 10% of it and produce the remaining of
the population from this 10%. And run the process again with new population. So, each VBCOM will get
a run of GA in second layer. Finally, it will get the best VBCOM along with best HMM.
Method 12:
In this method we will discuss about two layers, VBCOM in the first layer and SVM in the second layer. In
first layer, VBCOM will be used to get the best block combination of the training data. In the second
layer, GA will be used to get the best SVM for that VBCOM. When one combination will be chosen in
first layer, second layer will use that VBCOM to get the best SVM for that VBCOM. First layer will use GA
and create new population from best performer of previous generation. Whereas, on receiving the
VBCOM, second layer will use GA to get best SVM. It will create new population from best performer of
previous generation and run the process again with new population. So, each VBCOM will get a run of
GA in second layer. Finally, it will get the best VBCOM along with best SVM.

Method 13:
In this method we will discuss about two layers, VBCOM in the first layer and K nearest neighbor in the
second layer. In first layer, VBCOM will be used to get the best block combination of the training data. In
the second layer, GA will be used to get the best SVM for that VBCOM. When one combination will be
chosen in first layer, second layer will use that VBCOM to get the best SVM for that VBCOM. First layer
will use GA and create new population from best performer of previous generation and run the process
again with all combinations in new population. Whereas, on receiving the VBCOM, second layer will use

GA to get best SVM. It will create new population from best performer of previous generation and run
the process again with new population. So, each VBCOM will get a run of GA in second layer. Finally, it
will get the best VBCOM along with best SVM.

Method 13:
In this method we will discuss about two layers, least square method in the first layer with classifier in
the second layer, one from a list consists of NN, HMM, SVM and K nearest neighbor. That means, two
layers will be used. In first layer, least square method will be used. We can use anyone of NN, HMM,
KNN and SVM. If we chose NN, then we will create many NN structures where we will specify input layer
size, output layer size, hidden layer size and other information.  Similarly, if we choose HMM, we will
create many HMM structures along with many state numbers, sequence length, overlap and other
issues. And if we decide to go with SVM, then we will create many SVM structures with many weight
policies and clustering. If we choose K nearest neighbor, then we will create different structure of tree
with different combination of features or blocks. For each one of these, we will use many BCOM with GA
in second layer. We can also use least square method in second layer instead of GA. We will have to try
many BCOM and pick the one which performs best if we want to go with least square method. In the
second layer, GA will be used to get the best BCOM. When one combination structure of classifier is
chosen, second layer will get the best BCOM for that structure.

Method 14:
In this method we will discuss about two layers, neural net in the first layer and BCOM in the second
layer. In first layer, NN will be used to get the best structured NN for the training data. We do not know
which structure of the NN will be best performer for the training data. What size of the input layer
works best with the data? How many hidden layer works best and best size of the hidden layer. In the
second layer, GA will be used to get the best BCOM for a specific NN. When one NN structure will be
chosen in first layer, second layer will use that NN to get the best BCOM for that. First layer will use GA
and create new population from best performer of previous generation and run the process again with
new population.  Whereas, on receiving the NN, second layer will use GA to get best BCOM. It will create
new population from best performer of previous generation. And run the process again using new
population. So, each NN will get a run of GA in second layer. Finally, it will get the best BCOM along with
best neural net.
Method 15:
In this method we will discuss about two layers, HMM in the first layer and BCOM in the second layer.
That means, two layers of genetic algorithm will be used. In first layer, HMM will be used to get the best
HMM structure for the training data. We will try different structure for HMM with different sequence
number, different state number and overlap strategy along with other issues. In the second layer, GA

will be used to get the best BCOM. When one HMM structure will be chosen in first layer, second layer
will use that HMM to get the best BCOM. First layer will use GA and keep 10% of best performer and
produce children to fill the population and run the process again with new population.  Whereas, on
receiving the HMM structure, second layer will use GA to get best BCOM. It will keep 10% of it and
produce the remaining of the population from this 10%. And run the process again with new population.
So, each HMM will get a run of GA in second layer. Finally, it will get the best BCOM along with best
HMM.

Method 16:
In this method we will discuss about two layers, SVM in the first layer and BCOM in the second layer.
That means two layers of genetic algorithm will be used. In first layer, SVM will be used to get the best
SVM of the training data. In the second layer, GA will be used to get the best BCOM. When one SVM
structure will be chosen in first layer, second layer will use that SVM to get the best BCOM. First layer
will use GA and keep 10% of best performer and produce children to fill the population and run the
process again with all SVM structures in new population.  Whereas, on receiving the SVM, second layer
will use GA to get best BCOM. It will keep 10% of best performer and produce the remaining of the
population from this 10%. And run the process again with new population. So, each SVM structure will
get a run of GA in second layer. Finally, it will get the best BCOM along with best SVM.

Method 17:
In this method we will discuss about two layers, K nearest neighbor in the first layer and BCOM in the
second layer. That means two layers of genetic algorithm will be used. In first layer, K nearest neighbor
will be used to get the best K nearest neighbor of the training data. In the second layer, GA will be used
to get the best BCOM. When one K nearest neighbor structure will be chosen in first layer, second layer
will use that K nearest neighbor to get the best BCOM. First layer will use GA and keep 10% of best
performer and produce children to fill the population and run the process again with all K nearest
neighbor structures in new population.  Whereas, on receiving the K nearest neighbor, second layer will
use GA to get best BCOM. It will create new population from best performer of previous generation. And
run the process again with new population. So, each K nearest neighbor structure will get a run of GA in
second layer. Finally, it will get the best BCOM along with best K nearest neighbor.

Method 18:
In this method we will discuss about two layers, one classifier from a list consists of NN, HMM, SVM and
KNN in the first layer and VBCOM in the second layer. That means two layers will be used. In first layer
least square method will be used to get best classifier for the training data. In the second layer, GA will
be used to get the best VBCOM. When one structure of classifier will be chosen in first layer, second

layer will use that structure to get the best VBCOM for that. First layer will use GA and keep 10% of best
performer and produce children to fill the population and run the process again with all new structures
in new population. Whereas, on receiving the structure, second layer will use GA to get best VBCOM. It
will create new population from best performer of previous generation. And run the process again with
new population. So, each structure will get a run of GA in second layer. Finally, it will get the best
VBCOM along with best classifier.

Method 19:
In this method we will discuss about two layers, neural network as classifier in the first layer and VBCOM
in the second layer. That means two layers of genetic algorithm will be used. In first layer, neural
network will be used to get the best NN structure of the training data. And in the second layer, GA will
be used to get the best VBCOM for that particular NN. When one NN structure will be chosen in first
layer, second layer will use that structure to get the best VBCOM for that NN. First layer will use GA and
create new population from best performer of previous generation and run the process again with all
structures in new population. Second layer, on receiving the NN structure, will use GA to get best
VBCOM for that. It will create new population from best performer of previous generation and run the
process again with new population. So, each NN will get a run of GA in second layer. Finally, it will get
the best VBCOM along with best neural net.
Method 20:
In this method, we will discuss about two layers, HMM in the first layer and VBCOM in the second layer.
That means two layers of genetic algorithm will be used. In first layer, HMM will be used to get the best
HMM structure of the training data. In the second layer, GA will be used to get the best VBCOM for that
HMM. When one HMM will be chosen in first layer, second layer will use that HMM to get the best
VBCOM for that HMM. First layer will use GA and create new population from best performer of
previous generation and run the process again with all HMM structures in new population. Second
layer, on receiving the HMM, will use GA to get best VBCOM. It will create new population from best
performer of previous generation. And run the process again with new population. So, each HMM will
get a run of GA in second layer. Finally, it will get the best VBCOM along with best HMM.

Method 21:
In this method we will discuss about two layers, SVM on first layer and VBCOM in the second layer. That
means two layers of genetic algorithm will be used. In first layer, SVM will be used to get the best SVM
structure of the training data. In the second layer, GA will be used to get the best VBCOM for that SVM.
When one SVM will be chosen in first layer, second layer will use that SVM to get the best VBCOM for
that SVM. First layer will use GA and create new population from best performer of previous generation
and run the process again with all structures of SVM in new population. Second layer, on receiving the

SVM, will use GA to get best VBCOM. It will create new population from best performer of previous
generation. And run the process again with new population. So, each SVM will get a run of GA in second
layer. Finally, it will get the best VBCOM along with best SVM.

Method 22:
In this method, we will discuss about two layers, K nearest neighbor on first layer and VBCOM in the
second layer. That means two layers of genetic algorithm will be used. In first layer, K nearest neighbor
will be used to get the best K nearest neighbor structure of the training data. In the second layer, GA will
be used to get the best VBCOM for that K nearest neighbor. When one K nearest neighbor will be
chosen in first layer, second layer will use that K nearest neighbor to get the best VBCOM for that K
nearest neighbor. First layer will use GA and create new population from best performer of previous
generation and run the process again with all structures of K nearest neighbor in new population.
Second layer, on receiving the K nearest neighbor, will use GA to get best VBCOM. It will create new
population from best performer of previous generation. And run the process again with new population.
So, each K nearest neighbor will get a run of GA in second layer. Finally, it will get the best VBCOM along
with best K nearest neighbor.

Once an image is given, create inner blocks and outer blocks using VBCOM or BCOM. Whole image will
be divided into many outer blocks and each outer block will have many inner blocks. In tree, we can
keep one outer block in each level. As many images will have the same outer block, we will have some
cluster for these data. Each cluster will be a branch for that level. Each cluster will have cluster center
and radius.
We can also keep couple of outer block in each level. In that case, cluster the data using combined result
on data for all the outer blocks of a level. While searching for a match in the tree, test data will take the
appropriate branches and reach leaf node where it will find the match. For doing two layers GA using
BCOM or VBCOM on top, on receiving each BCOM from first layer, do GA to get the best tree structure
for that BCOM. And for doing two layers GA using BCOM or VBCOM on bottom, different tree structures
need to be created on top layer. When each tree structure will be provided to second layer, it will create
many BCOM for that tree and do GA to find the best BCOM for that tree.

Different techniques for neural net:
We can try new different types of structure for NN. Usually, every node in hidden layer is connected
with every node in input layer. Try this that while training ( fixing weights), input layer is not connected
to every node in hidden layer, but a portion of it and input layer randomly selects its connection in
hidden layer, same for hidden to output. This is good for maze, like where for a given input, we want
different output every time. It could be used in games, or online tests like GRE. Each node in input layer

is randomly connected to a portion of hidden layer node. It will create more variation and will help for
pattern recognition. Because different objects will not have the result unless they are same or close.

In another method, every node in input layer is not connected to every node in hidden layer. It has a list
to which it is connected and we will create that randomly. For example, we have 100 node in input layer
and 100 in hidden, then each node in input is connected to 10 nodes in hidden. Same for hidden to
output. Another idea we could implement is that the list of connection varies randomly. Like, first node
in input is connected to 10 nodes in hidden and second node in input connects to 15 nodes in hidden,
like that. This way, we are creating more variation or patterns.

In this method, in input layer for every node, we have a list of values. Each value is connected to
different nodes in hidden layer and also the number of connection for everyone in the list varies. Like
first one is connected to 10 hidden nodes and second one is connected to 15 different hidden nodes
(gets selected randomly). This way, we do not need to use recursive NN to recognize a shape. We give
the image of the shape at once and it is recognized with one big NN. Each row of the image is given to
each input node with a list. We could use 8 by 8 blocks in the image. The structure will be, for example,
10 nodes in input layer, 100 nodes in 1st hidden layer, 100 nodes in 2ed hidden layer, 50 nodes in 3rd
hidden layer, 25 nodes in 4th hidden layer and 2 nodes in output layer. Each hidden layer node has
different number of connection to next layer. For example, a node in 2ed hidden layer is connected to
10 nodes in 1st hidden layer and it is connected to 5 nodes in 3rd hidden layer.

In this method, for different range of values in input list, it connects to different nodes in hidden. Say,
for values between 10 to 20, it connects to 10 different hidden nodes, and for 20 to 30, it connects to 10
different hidden nodes. So, every input node has a list of hidden nodes to connect and different value
range connects to different hidden nodes from connection list. for example, 1st input node has list of 10
items and it connects to (1,5,9,7,53,62,14,16,25,31) hidden nodes. items valued from 10 to 20 connects
to (1,7,9,14) and items valued from 10 to 20 connects to (53,31,5,25).

For back propagation, get random weights and tune it with back propagation and store the weights.
Again, take random weights and fine tune with back propagation, keep doing this for a pre defined
number of times, like 1000, and take the weight set which gives best result. This way, it will not stick
with local minima. The result will not be global maxima, but still better then single try.

We can also go by column instead of having one NN for each row, we can have one NN for each column.
Going by column might give good results on some scenarios. We can also go by diagonals. Scene is
always rectangular. So, start from one diagonal, then decrease the top left and increase the bottom
right, at the end, top left will be bottom left and bottom right will be top right. Another idea will be to
divide the screen into four parts (or any number of parts). And select random number of values from
each part to gather input for NN. Like, we take 5 values from each of the parts to make it 20. NN input
takes 20. We take the same value from each part for all training set and for testing one. We can take
mixed number of values from each part or randomly take 20 values. We have to remember the locations
of values for each NN. These values will not be the same if the image is same or similar. We take values
randomly for the first training image, for the rest, we will take the values of same location in rest of the
training set. To recognize an image from a training set, we can also select the pixel value or block value

of random 5000 points and see which of the training item has maximum match. We can also do this,
while training, take the values of 5000 random points and save those.  Not only value, we can also add
gradient value and direction. This could be used in preliminary screening. The ones have most match, we
can use more precise test on it, like more random points to match.  Usually they will have a match of
5000 points by value, gradient and direction only if they are same or close.

Instead of random points, we can take the points with high gradient value and angle. Also use mean and
standard deviation. Find 500 of these points, and give 20 to each NN. Calculate the gradient of same
points in remaining train images and train NNs. we can also mix it up with 500 of these points and 500
random points.

Instead of using one NN for each row, we can do this: divide the image into small parts like small blocks,
block size could be 64 by 64. Then in an image of 1024 by 1024, we will have 256 NNs. if we convert the
image to 8 by 8 block, each NN has 64 inputs and two outputs, 1 and 0. It will be very good for shape
recognition or binary classifier. We have to do more to make it object recognizer, like face. The input
face image will have to go through 256 NNs for each face in DB. That is a long calculation. Two things we
can do. First one: use some binary classifiers up front.  It could be SVM or binary NN. Say, at first find out
if it is male or female. It passes through one NN, and divides the DB by two. Again young or old, then
child or young, black or white or others. After passing through the trail, use 256 NNs per remaining
images. We will have very small amount of data left to match after this.

We could also use clustering in some situations. Make as many clusters as possible. Like dimension of
the face, width, height, nose length, ear size, round and tall faces. Use SVM or NN to distinguish
between round and tall faces. Train with 5000 round faces and 5000 tall faces. We can use face
silhouette to do this. Get face height and width from silhouette. This will work on face with different
expression. Because expressions changes small part of the face, say 30 NNs will be disturbed, still we
have 226 NNs left. They will be the same. It will be around 10% off. In case of NN, just have to have
different weights for different lighting condition. If it is in outdoor place, also have to think about fog,
sand and rain. It will be nice to have some program to find the light, fog, sand and rain condition then
just pick the correct weights, and everything else is same. When using a series of binary classifiers, start
from most obvious and easy ones. It is good idea to keep NNs by row, column, diagonal and small parts.
When match will be found by one of them put it to all these NNs of the same image in DB and see if it
matches. Or it could be used by the other way around. When no match found, try other available NNs to
see if any match found.

We can think the object formation by pixels like object formation by chemical atoms in chemistry. Like
hydrogen and oxygen combine to water. Is there any rule of having different pixels on picture of
different object? I mean, object is build by some material, so the chemicals form the object. Chemicals
have rules for bonding, is there any rules for pixels to bond together similarly. 2 hydrogen atom stay
with one oxygen atom and creates water. Similarly, two red stays with one green creates an object or
pattern in image. Are there any patterns for each object image? It has to have a certain amount of H2O
to call it a water body. It has to be certain amount of wood particles to call it a wooden material. Can we

make atomic table for images where instead of different atoms, we will have different colors. Atoms
combine to object, colors combine to object image.

In shape recognition, in the training set, split the image into small blocks like 64 by 64. Find the gradient,
mean and standard deviation of the block. Find out which blocks have close resemblance, group them
together. Create one NN for each group. When test item is given, see in how many blocks, match is
found using NN. If we see a large number of blocks are recognized by NNs, then it is a match. Instead of
using NN, we can also use k means clustering. Create all the blocks of all training images and create
clusters for those as many as possible. Let say, we have 100 test images and we have created 256 blocks
of each image, then we will have 25600 blocks. Say from these blocks, we have found 200 clusters.
When a test image in given, see how far each block from the cluster centers is and determine how many
of them are inside the cluster radius. If a good number of blocks get fit in the clusters, then it is a shape
match. It will work if the shape has some variety in it. It will definitely be more fault tolerant then NN
and powerful.

Instead of clustering blocks randomly, we can cluster the blocks of same position. So, gather the blocks
of same blocks together and cluster them. Find the mean of them and see how far they are from mean.
Accept a certain length distance from mean. Get this distance from data instead of configuration file.
Say, in each cluster 70% data are 5 off the mean. So, tolerance is 5. In case of shape recognition, use
silhouette image or gradient image instead of pixel value. Actually, take all info for the block, gradient
strength, angle, mean and standard deviation. Human wears different dress every day. Get which color
is in the block how many times and find the average color graph (which color, how many times), also get
color by position. Say, 64 by 64 block has 4096 values. Which is the average color for each pixel? 4096 is
a big number, we can also use 4 by 4 block or 8 by 8 block inside 64 by 64 block. It depends on training
data and task. If we use 4 by 4 block, then it has 256 values. Find the average value for each of this and
the color graph. So, compare this with test block, if a closeness found in the values then it is very
probable match. Do not use color info if the shape comes with a varying color. Like human in clothes,
color is different every time of same object.

We can also introduce weight for each cluster. Get some weights for each cluster which will decrease
the standard deviation and bring the values close to the center of cluster. Every block could be a feature
to be used in KNN. Let say, we want to create a NN to recognize one person in different pose and facial
expression. One set of NNs needed for this. We just need his front view, right view and left view. From
these three images, we need to create around 1000 images of him with different pose and facial
expressions. Let say, all these images are 1024 by 1024. So, we will create block with size of 64 by 64.
Gather the same block (same by location) of all images and train a NN with these. So, after training we
will have 256 NNs. For any test image, at first we have to create blocks of it and pass those to
corresponding NN and if enough positive results come from NNs, then we have a match. When it will
recognize, it will recognize the pose and facial expression as well. But, clustering also might work.
Everything is same, just instead of creating NN for each block, create a set of clusters for each block.
With test image, block it and pass each block to corresponding cluster set. Find the common items
which are in the most similar clusters and use shortest square distance to find the match.

Instead of creating blocks of same size like 64 by 64, we can also create blocks with varying size. In the
area where we have found more variations, we can create small  block like 32 by 32 and in area's with
less variations, we can create large block like 128 by 128. Find out the variations by analyzing the
training data.  Start with same size blocks like 64 by 64 and do the clustering. Now analyze the number
of clusters and number of items on it. It could be a large number of clusters or less clusters. If large
number of clusters, then decrease the size, just break the block. And if less cluster, then combine it with
its neighbor (which neighbor is most appropriate). If neighbors are not willing to combine, meaning,
they are in standard form, do not combine it. Or it could be normal number of clusters, but some of the
clusters could have many items on it, break down those blocks. And if there are fewer items, try to
combine it. To use varying size of block is a very good idea. While using clustering, we can use mean and
st. dev of the input values of the block. While clustering one block data, use the mean and st.dev of each
image to create the clusters. Or divide the input numbers by a big number and add the values and then
create clusters. For each spot in the 64 input arrays, include some random weight.  Say, for position 1, it
is 100, for position 2, it is 1000, for position 3, it is 50 and so on. Then multiply the weight with value and
add the values and create clusters.

Instead of one NN per block, we can cluster the block data from all training set. In that case, we will have
one NN per cluster. So, we might have multiple NN per block, at least one. If the block data varies too
much, we have to train a NN to handle all these data. While clustering find how many similar set of data
available for the block. Say, training set has 100 images. So, we have 100 blocks for position 1. From
these 100 block data, for each block, find the mean and st.dev value. Strategy is close ones combine to
one cluster, and they will have one NN (set of weights) for them. As their values have similarity, it will be
easy to create a set of weights which can handle or recognize these values.


If we can separate all the patterns in the image, then use one NN for the pattern. One way to go is save
all the input for each NN, when test input will be given, find it in the saved inputs, if match found then
yes, NN has seen it. Let say the input is 1234567890 in an input layer of 10 values. Save all these values.
It will be very expensive. Say training has 50,000 data. So, each image has 256 blocks. For each block we
have 50,000 inputs stored. (We can reduce it by checking unique value). In worst case, we have to check
12.8 million inputs. But, this will for sure recognize an image if it is in the training. Create something with
these inputs to make it less expensive. Keep them sorted and grouped. Store them in a number tree.
Each level of the tree will have 0‐9 branches. We can store many patterns there. This will make little less
expensive. Like, group them by starting digit. NN has limit on how many variety of patterns it can
handle, but this one can take as many patterns as needed with wide variety. This is good as a recognizer.
But, variation will not work. Our goal here is to create something which will recognize the training data
with slight variation in it, because in real data, we will get variation instead of exact data. What could be
done with those 64 sequential numbers? Do mathematical operations, like arithmetic, create a circle
and save the center and radius. Or create bounding box.  When more precision is needed, use all the
NNs, like NNs by row, by column, by diagonal and by Block. Because, if the image has noise or
deformation in certain area of the image, then the blocks in that area will not match. So, we need to
match with some variety.

In another method, we can get a single block formation info and train and test NN for the block and
return the success rate to get the best weights for a training set, we can use GA. say for a training set of
1000 images create around 10000 set of weights and take the best weight set which gives best results

for training set, use GA to find out best one. Like, use genetic mutation and other concepts of GA of the
weights. After getting best set, we can also use back propagation on it to fine tune it. Keep the best one
of the current population.

To recognize a face with different facial expression, pose, lighting condition and other issues, we need to
create one NN for each face. Each face has a NN of its own. Train the NN with different pose of the face
with different angel and facial expression, lighting condition and others with the same face and train the
NN with all these different kind of images. This is a good idea to recognize a face with different
expressions and poses. Similarly, for different shape recognition, we can create one NN for one shape
with different angel, pose and view, same for other objects. For example, to recognize a bucket, we can
train a NN with different picture of bucket from different angel, view and pose. It will not solve for size,
we have to make different NN for different size. For human organ or real world materials, those do not
get changed in size. While creating NN, input from first node list are combined with some random
element from another input node's list. For example, 1st element of first node in input layer and 5th
element of 12th node of input layer gets connected to 7th hidden node. This is good for exact match
instead of shape like face. This is like little alternative of nearest neighbor.

Another important task is to find out how many clusters are there in the given data and cluster them.
Usually, number of clusters is given before clustering. But, if that number is not given beforehand, then
the strategy is, try to get with some random number for using as number of clusters and then process
clustering on the data. If the data is clustered properly without having clusters with no elements, then
keep increasing the number of clusters until we find any cluster with no element in it. Then the final
number of cluster is total clusters with some elements in it.

Use ransac for initial weights. Try around 1000 or more random initial weights and use the one which
has given correct result more. Then, use back propagation on this weight. In this way, the chance of
getting entangled with local minima is reduced.

Instead of one NN per block, we can cluster the block data from all training set. Then, we can use one
NN per cluster. So, we might have multiple NN per block, at least one. If the block data varies too much,
we need to train a NN to handle all these data. It is better to use one NN per pattern. While clustering,
find how many similar set of data available for the block. Say, training set has 100 images. So, we have
100 blocks for position 1. From these 100 block data, for each block, find the number of gradient points,
their value, angel, direction, mean and st.dev of pixel values. Close data combine to one cluster, and
they will have one NN (set of weights) for them. As their value has similarity, it will be easy to create a
set of weights which can handle or recognize these values. If we can separate all the patterns in the
image, then use one NN for the pattern.

In another method, get a fixed structure and a fixed BCOM or VBCOM info and create basic block nets
and return the success rate. weights got by training back propagation to get the best weights for a
training set, we can use GA. Say for a training set of 1000 images, create around 10000 set of weights
and take the best weight set which gives best results for training set, use GA to find out best one. Like,
use genetic mutation and other concepts of GA for the weights. After getting best set, we can also use

back propagation on it to fine tune it. Keep the best one of the current population. We can replace the
whole population with children or replace a portion of it. Different one will be useful for different
applications. Replace half of the population with children from couple of best parents and take new
random weights as another half of the population. That’s how we are not stuck with local minima as
new random weights are coming to population.

We can also take the best function to calculate the input weight and output weight. Usually, people use
a predefined function to do that. They try one function after others to see which one fits best for the
training data. We could use GA to get the best function for the given data. Genetic algorithm could be
used to find an object. Say, each object has an NN. Given an image, we need to find out which object it
is. Instead of looking every object one by one, we can use genetic algorithm to find it. Say, we choose
some objects from all objects NNs in DB and apply genetic algorithm to those. If not found keep taking
objects randomly from population and discarding failed ones. And while checking an NN, we can use
partial checking, like randomly check some of its layers, if significant match found, stays otherwise
discard. This is good for search in big space. Instead of searching the whole DB, it is good idea to search
in a random collection from DB. There is a good chance that the desired data could be found in the
collection. If not, we will collect some more and search again in new collection. In worst case, we will
find it in last collection. But, as we will do random selection for collection, there is a good chance to get
the data in early collections. Say, we have one million data to search for an object to match. Select five
thousand random data from this big data and search in it. If not found, take another five thousand
random data from remaining data and search again.

Different techniques for HMM:
For HMM, at First we need a state sequence. HMM deals with sequence. It is very difficult to deal with
every number. That is why we can use a range of numbers as a separate state. For example, if we have
only gradient points and normal points in the block, we can introduce two states, gradient or normal.
We can consider values from 0 to 124 as normal and 125 to 255 as gradient point. It is like fuzzy logic.
We do not have to work with exact value, we will use approximate values. We can introduce as many
states as we want. And give each state same priority. For example, we can create state for a range of 10.
Then, 1‐10 will be state number 1, 10‐20 will be state 2 and so on. We can also introduce overlap. That
means, we will give some state more preference than others. For example, 0‐30 will be state 1, 30‐40
will be state 2. Here state 1 is getting bigger range, getting preference more than state 2. We can do
that based on the training data we get.

So, block data will be some array of double or integer. First we have to convert it to state sequence. We
have to convert each number to a state. Then, we will get a sequence of state using the block
information. After that, we have to find the state probability. Meaning, what is the probability of going
to state 2 from state 1. In the training data, find out how many times it goes from state 1 to state 2. That
will be the state probability. We will have to use location information as well. That means, what is the
probability of going to state 2 from state 1 while current position number is 1. It is a sequence, so we
have to consider the position, emission probability. We will also need the probability of any particular
number in specific position of the sequence. For example, what is the probability of getting 100 in
position two? Then, when we will give the test sequence, it can tell us if it has seen this pattern or not. If

it is in the training data, its probability will be very high; otherwise we will get a very low probability,
which will indicate that it has not seen the data before.

We can also create a tree with the state sequence. Tree will help us to find the exact seen pattern. In the
tree, each level will have branches same as state number and levels same as sequence length. But, the
problem will be with slightly different patterns. It will not find the slightly different patterns, tree will
take to wrong path where it will not find any leaf node. We are using a range of data for each state, so
that will give us little variation to handle little off data. For example, 0‐10 is state one. So, if the value is
5 or 9, it will still be in state one. Real‐time data will be little off than the ideal data. If we use mismatch,
we can try to reach leaf node by using alternative route if necessary and mark the mismatch level. Say,
from ten levels, in five levels, we have found match, then it is 50% match. We can try other child nodes
or other nodes from parent node to reach leaf node if no match found in a level.

We can even create a tree to store all the patterns without state sequence. Say, we have given a pattern
of length 100. Then, we will create a tree of 100 levels and insert that pattern in the tree. When we
want to insert the next pattern, we will see if we have the first value as a node in first level, if available,
we do not need to create a node as it is already there. But, if it is not available, then we will have to add
a new node in first level. Then, we will move to second level and see if the second value in the given
pattern available in the second level as a node. If available, we do not need to create it, otherwise we
will have to add a node. In this way, we can populate as many patterns as we want and use it to
recognize patterns. Each level will have all unique numbers as node. Say, we are storing 1000 patterns
and we have 30 different numbers as first node in the pattern, then, we will have 30 nodes for first level.
This will be useful for exact match. We can also create a tree by considering each digit. So, in this tree,
every level will have branches from 0‐9. When we will get the sequence, we need to place them all side
by side, one after another to create a digit sequence. That will give us digit sequence and we can
populate that in the tree. This is not a bad idea or way to store patterns and search for match. We have
to think about creating the tree using RAM, otherwise, it would be very handy tool. If we need some
leeway while matching, then using state sequence will be a good idea.

We can also use the conversional way of training data using Viterbi or Beum Welch or expectation
maximization training procedure. In conventional way, state has no relation with real value. We can
introduce as many states as we want and we have to start training with random state probability.
Previously, we were saying 1‐100 is state number 1 and so on, but here state will not have relation with
data value. It is just randomly created with random initial probability which gets corrected through
training or we need to try to get the best random probability which works best with the training data.
While using viterbi for training, we will start training with random state transition and emission
probability, but those will get corrected through training. If we got more close to real probability, then it
will not get corrected much. We can try many of those and take the one which goes more with the
training data.

For Beum Welch or EM method, we will also try with random probability, but if the initial probabilities
are close to real ones, they will not change much. We can try many of those to see which one really
close to real data and take the one which goes with data more. HMM structure has two parts. Length of

hmm string and number of states. So, randomly choose which one to keep and which one to change
while creating children using genetic algorithm. Say, we want to keep string length unchanged, then
change the number of states randomly.  Instead of pre defining the states, we can use clustering to get
state sequence. Each number in the sequence will be handled by a level in the sequence tree. For each
level, we will use clustering and find out how many clusters are available for that level’s data. That will
be state number for that level and that level will have same number of branches as cluster number. We
will do that for each level. When we need to insert data to state sequence tree, at first we will examine
the given data and find out the cluster where it belongs, that will be its state number. For example, first
value in the sequence is 50. We will find out this value fits in which cluster in first level. This will give us
the state number for first data. We will do the same for the remaining data to get state sequence. Say,
we got a value of block data, to know to which state it belongs, we need to calculate and find out the
closest cluster. Sometime, we need to process clustering for data inside the block if we have large
training data set. If a block’s data has clusters in it, then each cluster will have a HMM classifier for itself.
Try using column, diagonal, small sequential parts, randomly selected parts (BCOM or VBCOM) for each
HMM instead of just row based. Random selection of points and randomly mixing with high gradient
points is also a good idea. It will give good statistical result. So if we take 64 by 64 block, and gave 8 by 8
block before blocking it, then we have 256 values for each HMM. It will receive a string of 256 value and
train the HMM with these values.

Different techniques for SVM :
First, we have to get the block information. Then, we will create random weight sets for each block
combination. For example, we are dealing with block 1. Then, we will get all the block one data from
each training data and examine those. Now, we will try many weight sets for this block to find out which
weight set works best with the given data. Each block member, which usually a number calculated from
a inner block using its values, mean, standard deviation, number of gradient points and their values, will
get a weight; that is why we will need a set of weights for each outer block. Then, we will find out the
range for each weight set. Range means the difference between the highest value and the lowest value.
When we will use weight set on each block data and we add those values, we will get a total, that will be
its value. We can also use many other mathematical formula here as well instead of simply adding the
values of inner blocks to get the total. Then, we will find out what is the range for each weight set. Using
the block information of training data, we need to find out what is the highest and lowest value for each
weight set and calculate the range by getting difference between them. We will take the weight set
which gives lowest range for a block. That means, this weight set works better than others with this
block’s data.

If we have large training data, we can use clustering. We will cluster the block data and we will have one
weight set for each cluster. For example, one block has ten clusters in it. It is easy to get the clustering.
We can use k means clustering here. For simple version, we can add the block data for each image and
get the total. Use k means clustering to get how many clusters we have with these totals. So, when a
test block data will be provided, at first we have to find out in which cluster it belongs. Then use the

proper weight set for that test data. If the test data stays within the range that means it is similar with
training data, otherwise not.
It trains recursively with all rows sequentially. We have one SVM for each row. For each row data train
some weights using positive and negative training data. Use different type of blocks, like by row, by
column, by diagonal, by small parts (BCOM, VBCOM). The weights will give positive result on positive
training data and negative result on negative training data. Training data must have to have both
positive examples and negative examples. For example, to recognize hand, training data should have
images of hand of different people with different size, color, male and female hands. Similarly, we also
need some images which are not hand to train the weights. Weights will help the result to become
positive for positive ones and negative for negative ones. We can use 8 by 8 block. So create 8 by 8 block
in the image. After that, each row is trained with different weights.

We can also do this: get some weights, like one for each block, for negative one and for positive one.
The weights for positive one should give a value with some range. Like it will be between 5 to 15, same
for negative. So, when test image will be given, try the positive weights and see the value. Then try the
negative weights and examine the value. If the value fall in positive range, then positive image and vice
versa. Use clustering if it is hard to find single weight set for a block. From training images, create as
many clusters as possible and create weights for each cluster. When testing, try all cluster weights for
positive to recognize it as positive and do same for negative.

Given two sets of images for binary classification, find the points where we get two different results or
distinguish result using radial basis function or sigmiodal function (kernel function). It means, at this
point, training images give two different results. It means, at this point, two image sets differs. Binary
classifier can recognize objects by using these points. These points are called support vector. Find all
support vectors in the train images, this is for binary classifier, but for class classifier, like recognize
character, find the points where all the values of the training images stay within a limit. It will be difficult
to go to every pixel, create block and also weights. Multiply block data with weights, then send it to
kernal function. Find the blocks which are support vector. Use random block with genetic algorithm
to create children, get the best 10 % of last population, create 90% new children. Randomly choose a
combination from the best 10%, and then randomly select how much of the combination to keep
unchanged and how much to change. Then randomly get combination for the changing part. Say, we got
10 combinations as best ones and we have to create 90. Each time, select a data randomly from 10 data
and randomly select how much to keep. Say, it came out 50%. So keep 50% of the selected combination,
and change the other 50%. Say, a combination has 256 blocks creation info. So, keep the first 128 blocks
creation info unchanged. And for the remaining 128, randomly get the block creation info from the
remaining inner blocks available to be used. Some of the blocks are already used for the first 128 blocks
creation info, so use the remaining ones.

Different techniques for KNN:

We can use one outer block for each level of the tree and use clustering to get the branches for a
particular level. Add all the values of that block and process clustering using this added total value.
Other option is, instead of using one feature per level of tree, we can use couple of features at the top

levels, and as the levels increases, number of features per level decreases. Let say, we have 128 feature
vector. So, use 10 features for first level, 9 for 2ed level, 8 for 3rd level. Now, it gets number of partition
from configuration file or from clustering, and creates that number of partition for that level. Once we
get 128 feature vector, find out which feature has more variety in the training data set. More variety
features could be in the top or bottom and have many partitions and less variety data will get less
partition. In some cases, placing them at the top will be appropriate and in some cases at the bottom.

Another idea is having multiple features for one level. Let say 5 features for first level, 4 for 2ed and 3
for 3rd and so on. That’s how we will not go astray when the data misses a feature. We could have same
data just changes in one feature. We will not get that if that feature is on top. We will be directed to
wrong branches of tree. Combining couple of features at top will reduce that type of errors. This type of
tree could be of two types. One is normal, we will have multiple features on top level and less on down
levels but the number partitions per level stays same. Another could be number of partitions will vary.
At the top, less partitions and more at the bottom. How do we know which features should be together
in a level? It could be sequential, like first 5 features in 1st level, next 4 in 2ed level and so on. Or it could
be less varied features together at the top. Or it could be to find which features have same items in their
partitions. We could partition the data with uniform partitions. After that, see which item is on same
partition on different features. Combine the features in a level which has more similar items in the
partitions. For example, item A is in 1st partition of feature number 1 and 2 and item B is in 1st partition
in 1 and 2 feature. Let say, out of 10 items, 6 items stays same as A and B. so, it is more probable to
combine feature 1 and 2 at top level. Put the largest combination of features at the top and decrease
downward.

Think about tree structure. In real tree, number of branches is not fixed, it gets changed with
environment, data in our case. Every level has different number of branches. This is good if we have the
data to populate the tree beforehand. What about online data, meaning we do not know which data will
come, there tree gets populated as new data comes. That is a different scenario. But most of the AI
application where people do training first, there they have the data beforehand always. Image
processing, speech recognition, natural language processing and some others also do the same way. Or,
we could use more branches at the top and less at the bottom.

Let’s think about clustering for KNN instead of tree search. Let say, we have given a training data with
128 feature vector or 128 outer blocks in each image. Simple way to classify them will be to create 128
clusters, one cluster for each feature. And find out which items are in the same clusters as the test one.
Now, find the item on the common test. Like, 100 items are exactly same cluster as the test one. Then,
find the square distance or Euclidean distance of the test item with each of these 100 item (sum the
distance of each feature) and choose the closest one. Let say, 100 items are in 50 same clusters and 100
items are in 40 clusters, go with 50 same cluster items. Use k means clustering and save the centers for
each clusters. When test item is given, see in each feature cluster, where the test item belongs, find the
closest among the centers of the cluster level. Because, clustering will give proper partition for each
level. Keep few items in a cluster to reduce cluster size.

We can think each block of the image as a feature. For example, we have been given 1024 by 1024 sized
images. We need to divide the image by 64 by 64 blocks. Gather the same block from all training images
and create as many clusters as possible for each one of the blocks. Say, training set has 1000000 face

images. Create 256 block of each image and gather same block from all images and create clusters for
that block’s data. So after training, we will have 256 set of clusters. When testing an image, create 256
blocks of the test image, and see which block belongs to which cluster. Find the items which have most
similar cluster list for each level as test image. Then use the shortest square distance or Euclidean
distance method (sum of all features) to get the match. The one which gives shortest distance will be
the match. If not enough cluster match found, in that case result will be no match found.

Another method could be to find the variation of given data on a particular feature. Find out how close
the data is. Start with small partitions and see how it performs on the partitions. Are all the partitions
have close number of items. If does increase the partition number and repartition it and check again.
Keep doing this until we find weak partitions like partisan with very less items comparing to other
partitions.

Use clustering for each feature and use KD tree. Clustering will tell us how many partitions we have per
level and the partition boundary. We need to use couple of features for top levels. Say, 1st level has 5
features. Each feature has 5 clusters. So, when a data comes, use 5 features to cluster the data for 1st
level. In that way, we will not miss it if one feature does not match. The cluster is produced from
combined effort of 5 features. Do this for all combined feature levels. Use less variation clusters for top.
At First, cluster by each feature and find which features have fewer clusters. Sort them by cluster
numbers. Use less clustered features in the top level. Also try to re‐cluster them with multiple features
together.

We can use genetic algorithm for this. We do not know which features to combine together and also
how many to combine for which level. Create many KD trees with different random combination and
choose the best one. Genetic algorithm or least square method, both will work in this kind of situation.
Try some 1000 combinations and choose the best performer. But, GA will be more powerful. Start with
1000 or 5000 random combinations same as least square. Separate the training set by 2 parts. One for
training and other for testing the intermediate steps. The testing ones should be tagged.

Randomly select how many features will be used in which level and which features for which level. Like,
for 1st level randomly select between 1 to 128 (128 is total features here in this example) for number of
feature for 1st level and say 5 came out from random number generator. Now, choose 5 features
randomly from 128 features. Selected features are out from getting selected again and select number of
features from remaining feature list. Keep an unassigned feature list. Use clustering for each feature and
use KD tree. Clustering will tell us how many partitions are available per level and the partition
boundary. Use couple of features for top levels. Say, 1st level has 5 features. Each feature has 5 clusters.
So, when a data comes, use 5 features to cluster the data for 1st level. In that way, we will not miss it if
one feature does not match. The cluster is from 5 features. Do this for all combined feature levels. Use
less variation clusters at the top. At First, cluster by each feature and find which features have fewer
clusters. Sort them by cluster numbers. Use less clustered features in the top level. Try to re‐cluster
them with multiple features together if needed.

How to create population for BCOM:

Inner blocks are simple and small blocks of an image, for example, 8 by 8 block. And outer blocks are
created using inner blocks. We could use it differently in different kind of situations. We can make it
sequential data in an image, each row and column will come sequentially from top to bottom. Or we
could take each row or column or diagonal line as an inner block and combine those to get outer blocks.
Or we could get random points for inner blocks and random combination of inner blocks for outer
blocks. Outer blocks could be of same size or different size. While collecting data for inner block, we
could only use pixel data or only gradient data or mixture of those. We can also use mean and standard
deviation of pixel values of inner block points. In shape recognition or where we use silhouette, we will
only need outer contour of an object, so in that case we are interested only on few gradient points with
some range of values. In that case, we can only use those gradient points. Instead of giving all the
gradient points into a classifier, we could divide those using blocks and take those to classifiers. In some
case, we are interested more on exact data of the image, for example, face, finger print, palm
recognition and others. In those cases we can use inner block and outer blocks as BCOM or VBCOM.
Randomly choose which points will be in which block. Create many block combinations using this
method. We will test which combination goes well with the training data. And we will do GA, so
eventually we will get a very strong combination which will work well for recognition.
We will use inner block and outer block in BCOM. Inner block is a simple small block, for example, 8 by 8
blocks in an image and outer block consists of inner blocks. If we have 64 points in an outer block, then
each of them represent one inner block. To represent inner block, we can use various information
available inside that block. For example, mean value of pixel value of points inside the block, standard
deviation of those points, number of gradient points, values of those gradient points and some other
information. Using all these information, we can derive single number mathematically which will
represent that block. In some scenarios, we will only need gradient information, in some case, we will
need pixel information if we are trying to match exact color values and in some cases, we will need both.
Whichever is the case, we can use appropriate info to represent inner block. We can even get the inner
block info from adjacent rows and columns of that block or we can select the points randomly while
using BCOM or VBCOM. It is totally depend on the situation. In some cases, direct block info will be
helpful and in some cases, random data for inner block will be helpful. We should keep all these options
open while programming, so that we can choose any option by just changing values in configuration
files.

If the training data has too many variations, then clustering the data is necessary to produce better
result. This means that we have to find where data has separate clusters in it and divide the data
accordingly into many clusters and consider those separately. So, when a test image is given, we will find
out in which cluster it belongs and use that particular classifier or separate weights for that cluster. This
is for huge training set where data will vary a lot. That is why we need to keep separate clusters to see in
which cluster test sample belongs. We can use k means clustering to cluster the large amount of training
data. It simply finds the proper cluster center of the training data and brings the adjacent data close to
the nearest center.

It takes one BCOM’s data or one VBCOM’s data. It takes one block’s data from each image in the
training set and gathers them and finds out how much these data varies. Training set with more than a
million should have some good clusters in the data. It depends on the nature of the training data set. We
cannot tell that beforehand, we will have to find that out dynamically. If we know that our training data
will have large amount of data or if we know that the data will vary a lot, then cluster method is
appropriate for that kind of situation. Also if non cluster version does not produce desired result, then
cluster version should be used. On non cluster version, there will be one classifier for each BCOM, but in
cluster version, there will be many classifiers for each BCOM. At first we have to find out a specific
BCOM of test data belongs to which cluster data of same BCOM’s training data, and then use the proper
classifier for that cluster. If we have many data in training set and we find it difficult to fit one classifier
for each BCOM, then we should use cluster method to separate training data into many clusters and
train those separately. We will have to choose for each BCOM which cluster to use. In this way, even we
have millions of data in the training set, still we will have less difficulties to find patterns. If there is not
much variation in the training data, then we can use non cluster version.

Use mean and standard deviation for inner block. Usually, we will provide one weight for each inner
block. Use that with standard deviation and mean of the block. If we have cluster in outer block, then
one weight for each inner block for each cluster. If we use only gradient points or feature points in
outer block, then one weight for each point. For large training set, we will have cluster in outer block
info. Use mean, st.dev, gradient point info for inner block and weight. In cluster mode, each cluster will
have its own weights.

How to create children for BCOM:
Keep a portion unchanged and change the remaining. Randomly choose what percent to keep
unchanged and then change the remaining. Only the available points, which are not used in unchanged
part, will be considered for changing blocks.

How to create population for VBCOM:
VBCOM also has inner and outer blocks. But, here outer block’s size is different for different outer block.
Randomly choose which points will be in which block. Create many block combinations using this
method. We will test which combination goes well with the training data. And we will do GA, so
eventually we will get a very strong combination which will work well for recognition.

How to create children for VBCOM:

Keep a portion unchanged and change the remaining. Randomly choose what percent to keep
unchanged and then change the remaining. Only the available points will be considered for changing
blocks.
How to create population for SVM:
Create weights for each block. One weight for each member in the block. So, we will need weight set for
one particular outer block. Get the range of values after multiplying each value of outer block with its
weigh and add those to get total. Now use this weight set to all the same blocks from all images in
training set and get the highest value and lowest value and subtract those to get range for that weight
set. We can try many weight sets and select the one which gives better result. Now examine how many
data fit in the range? If it becomes very hard to fit block data with weights into small range, then we
should cluster the block data and create as many clusters as possible. And use separate weights for each
cluster. We can use different mathematical functions to get the single value for inner block data.

How to create children for SVM:
Keep some part of it unchanged and make changes on the remaining. Get new weights for changing
parts. We can also change the block combination of the changing part. If we are using clustering, we can
process clustering again for the newly created parts.
How to create population for K nearest neighbor:
We need to create random structure of KD tree with different combination of levels and features. If we
are using BCOM or VBCOM, we can provide each BCOM or VBCOM (one outer block) as one level of
tree. Or we can combine couple of outer blocks for each level. This is for when we want to use KNN as
classifier. In case of just finding match for array of data, we can use the feature vector of each data and
create block using a portion of feature vector for each level. We can use each feature as a level or
combined features as one level. We can also introduce more levels on top and less on bottom. Features
with less variation at the top and features with more variation at the bottom. We can find out about
variation using clustering. Clustering will tell us how much variation we have for a particular feature or
outer block or combination of features.

How to create children for K nearest neighbor:

Method 1: by only flipping the levels is one of the ways to create children from parent. In this method,
we need to flip the levels randomly. Say, level one becomes level three and level three becomes level
one. As we are working with dynamic data, we might find a better result with this as well. We would not
know about it beforehand whether this will work or not. We just have to try it and find out the result.

Method 2: Flip the levels with feature changes, meaning, create the feature list again for each level. This
is same as the previous method, this time, not only flip the levels but also change the data in the level.
For example, level one has three properties and level three has five properties. Once level three

becomes level one, it will still have five properties, but different ones. We have to change the properties
for this level. We can get it by randomly choosing from the properties which are available and not taken
by other levels.

Method 3: in this method we will keep some levels unchanged while change the rest with remaining
features. For example, we have ten levels. So, keep five levels unchanged and change the structure and
properties of the remaining levels. Say, level six had three properties, now randomly do selection and
say we have randomly decided to keep four properties for level six. Now, get these four properties
randomly from the remaining properties.

Method 4: in this method, we will keep the number of levels and number of features in the levels same
as before, but we will just change the feature combination. In this method, keep the level number and
number of features for each level same, but change the properties of each level. For example, level one
has three properties‐ property number 3, 5 and 7. But, in the new structure we will choose randomly to
select the properties for level one. Say, we got 4,6,10 in random selection, then these three numbers
will be the properties for level one.

Method 5: Randomly choose what percentage of the tree to remain unchanged. Change the remaining,
meaning, for example, randomly we have found that we want to keep 50% of it. So, keep half of the
levels of the tree and change the remaining of the tree. For the levels which will be restructured,
implement it in the same way when we have created the tree for the first time. Randomly choose how
many properties will be held in each level and which ones. Only the properties which are not used by
the upper part will be used in new levels.

How to create population for neural net:
Create NN with many layers with different nodes in each layer. Randomly create different layers with
different node numbers. We will test and find out which ones work best and also do GA so that we can
eventually get to the best one. In NN, input layer will be same as block size and output layer will have
two nodes for yes and no. we can create as many layers in between. Create random weights for each
node at the beginning so that we can start the process. We will create some models and test and find
out which ones work better. We can process the training by Back propagation method and weight based
method. We will create some random weights and see which one works better.

Every node in left layer is connected with every node in right side layer. Every node in right side has a
connection with every node in the left. And every connection has a weight. So, every node in the right
has many input connection from left nodes and it has an output connection to every node in the far
right nodes. So, it calculates all the input connection with its weight and decides an output value for
that. It also has many weights for nodes in the right. So, each node has an input weights and output
weights.

classifier

Recommended

Recommended

More Related Content

Similar to classifier

Similar to classifier (20)

More from Mutawaqqil Billah

More from Mutawaqqil Billah (14)

classifier