k_nearest_neighbor

Using K nearest neighbor efficiently and smartly in A.I. Applications

Mutawaqqil Billah
Independent Researcher,
B.Sc in Computer Science and Mathematics,
Ramapo College of New Jersey, USA
Address : 906/2, East Shewrapara, Mirpur,
Dhaka, Bangladesh
Phone: 8801912479175
Email : mutawaqqil02@yahoo.com

K nearest neighbor is used when we want to match specific data from a large data set. For example, we
can consider matching a person from a country’s people database where each person’s data has many
properties to define it. A person has many features which confirm his identity. So, every data in people
database of a country has many features which define one particular person. So, we need to match an
unknown person, who has many features, from a database where each data also has many features. K
nearest neighbor is the right choice in this kind of situation.

We use k nearest neighbor when we have an array of data where each data has many features to define
it. For example, an atom has three properties, number of electron, number of proton and number of
neutron. These three data are characteristic properties of an atom. We recognize an atom using these
properties. Similarly, in pattern recognition, at first we need to find all the distinct properties of an
element which will help us to find that element. Let say, we have a data set of one million people and
for every people we have 100 different properties which will help us to find a particular person. In this
case, we will use k nearest neighbor to find a person from a data set of one million or more people.
There are many ways by which we can implement that. I will go through each option so that we can
understand which method is the best choice.

Method one:
For each property, we will divide it into small equal spaced sections. For example, for property number
one which has some integer value in it, all the numbers from 0 to 10 will be in section one, 10 to 20 will
be in section two and so on. That means, we will divide the range of that property’s data by ten. We will
do this to all properties. Now, question may arise how could we know that all the data will be between 0
to 100? Data set will have dynamic data, that means we will not know what will be the range of data.
We need to find this range dynamically for each property, this way it will do best fitting for each
property. When a given data set is supplied, at first we have to find the lowest and highest values for
each property and then divide the range by some predefined number to get equal spaced sections. After
processing this, next step will be to walk through each data from the dataset through these ranges of
data and find which property lies in which range or section. For example, take one data and sort the
properties. Take the first property and find out in which section it belongs. Another example could be
after dividing the first property into many sections dynamically, we have sections starting from 0 to 100
and each section’s width is 10. If the value of first property of the sample data is 33, it will be in 4th

section. Similarly, we have to find the section number for each property and store that data. After
processing all the data in this way, each data will have section number for each property instead of real
value.
Now, when an unknown person’s data will be given, at first we have to walk through this data by the
sections of each property and get a sorted list of section number for all properties. Then, find the data in
the dataset which has closest similar list. A simple way to do that could be to compare with each data
and find the match. Each data in the training set will have a section list containing information about

which property goes to which section. So, for each data we will have property number and section
number. This data will be stored in computer so that we can use it when we need it to recognize test
data. As this method uses range instead of exact data, we could get multiple matches for test data, but
that number will be very small. We can use Euclidian distance method to find the closest data if we have
multiple candidates. Say, after processing the matching of section list, we have found ten data which has
the same section list as the test data. We can then do the Euclidian distance method to get the data
which is more close to the test data from these ten data. In this way, we are not doing Euclidian distance
with million data, but doing it for only few data which will save us from doing a lot of unnecessary
calculations.

The drawback of this method is that it is very slow. When the data set will have 100 million data in it,
then it will take long time to find a match. To get the section list will not take time, but to match the
section list of each training data with test data will take time. It will be much easier if we use tree to
search for the match. We will discuss that in the next method.

Method two:
This method is same as method one, but the only difference is that we will create a tree using the
sections data and use that to find a match. We know that it is very easy to search data in a well defined
tree. For example, if we have hundred properties for one data, then create a tree with hundred levels.
And for each level, use the dynamic range info to create branches as we do not know what will be the
range beforehand. Each level will have its own sections based on the given data set. We can use same
number of sections for each level or different number of sections.
For example, level one has range of 0‐100 and level two has range 0‐200. We can decide each level will
have ten branches. In that case, each section of level one will be separated by 10 and for level two, it
will be 20. Or, we can decide that each section will be separated by 10, in that case, first level will have
10 sections and second level will have 20 sections. We can try both ways and see which one works
better for a particular training set.  Obviously, if we use different section number, that will be more
natural solution.
Once the tree is created, our first task will be to populate the tree by inserting data into it. As each data
has same number of properties, all the data will end up in the leaf node. This tree will help us to find the
match quickly. After coming down from top to bottom of the tree, we will find the match. If we have
multiple data at the leaf node for a test data, then we can use Euclidian distance method to get the
appropriate match.

Method three:

This method is same as method two, just has a small difference in implementation of the tree. We know
that implementation of tree is a difficult task for programmers, especially because it takes a lot of
memory. We can reduce the use of memory by doing some tricks. The trick we will apply is not to create
all the nodes for whole tree, just the ones which are going to be used by training data. It will help us to
save space in memory. We do not need to create the nodes which are not getting used. There will be
many paths which are not used by training data. For example, first property has ten sections and range
is 0 to 100. In the training data, none of the data has value from 70 to 80 for first property or first level
of the tree. In that case, do not create node for this section. We can even get rid of the section as no
data will go there. But, it will be wise decision to keep it if we decide to add data in the tree dynamically
at runtime. Keep it null until we have data for it. It will only take space in the memory if we initialize it
after we know that we are going to use it.

Method four:
This method proposes multiple properties for each level along with equal number of properties for each
level. If we create a tree with one property in each level, then we will have two problems. The First
problem is, we do not know which properties should be in top levels. Properties in top levels will be
encountered first while traversing the tree to find a match. Usually, all the features do not weight the
same to find a data. Suppose, a data has ten properties and three of them are very important. In that
case, we want to place them at the top of the tree so that we do not miss them, in other words, we
want to give them more priority. So, the properties in top levels should be the important ones. If all the
features are equally important to identify the object, then we have no problem to go with previous
methods. But, in reality, we will see that some features are important than others. And the second
problem is, if we miss one property, we will be in wrong branch of the tree which will cause us to find
wrong data or we will not find any data. For example, we are looking for a person and we have
populated a tree with 30 properties of a person. Our test person misses one property for some practical
reason at the top level of the tree, and then program cannot find him. Real‐time data is always little off
than training data because of many types of disturbance in real world. For example, usually we take
picture of someone at home and keep that in the training set. But, when we get his picture outside of
his house, especially outside in the sun with dirt and dust, his face data will be little off.
So, it is good idea to keep couple of properties in each level so that the weight of each level becomes
high. And if we miss one property, we will not be directed to wrong path or at least we will have a
chance to stay in the main path. Next question is, which properties should be on top levels? By getting
random training data set, we do not know which properties are important and should be kept at the
top. For a particular task, we might know it and use it in classifier. But, in general case, we will not know
that. Say, we want to use this classifier for any type of data, we will have to encounter situations about
which we do not have any prior knowledge, in that case, we will not know which properties are
important. For example, a robot using it to train itself about data we have seen for the first time, say for
mars land data. Then, it needs to find the important properties by working with the data. If we do not

use genetic algorithm, we can use least square method. Where we will try different combinations and
choose the best performer. But, GA will give us real logical solution.

Method five:
This method proposes more properties in the top levels. It is not bad idea to have more properties in
the top levels and less in the below ones. This way, we will not be directed to wrong path. Say, we have
chosen five properties for level one. If a test data matches four properties and misses one, we will still
have a chance to go to the right path. In real data, it will be hard to match everything. For example, one
person’s picture in the training set and real time picture. Even though they are for same person, still
there will be some differences in those.

Method six:
This method proposes using BCOM with GA or least square. BCOM means block combination and
VBCOM means variable length block combination. In BCOM, all the levels will be same sized or it will
have same number of properties or features in all levels. Each level of the tree will get a different sized
feature list in VBCOM. And we will do GA on it or use least square method to get better result. In least
square method, we will try different BCOM sets and choose the one which performs best. In the BCOM
sets, each set will propose a structure of the tree, meaning we will test many of the BCOM and keep
10% (or any predefined percentage) of best performer and produce children from these to fill up the
population and run the process again with new population. We will create tree and populate data
according to the structure defined by BCOM and test the data and see how it performs. Then, from
those, we will keep the 10% of the best performer and remove the remaining BCOMs. And create 90% of
the new population from these 10%. We will keep running it with predefined number of generations and
get the best performer.

Method seven:
In this method, we will discuss about VBCOM with GA or least square. VBCOM means variable length
block combination. Each level of the tree will get a different size while using VBCOM. And we will do GA
on it or use least square method. In least square method, we will try different VBCOM sets and choose
the one which performs best.  In VBCOM, if level one’s size is three, then the size of level two will not be
necessarily  two. It could be any number. This method is more dynamic and natural and will work better
than BCOM. In BCOM, we need to decide what will be the size of each level, but when we will get
dynamic data, we will not know about it beforehand. We will not know which size in best for that
particular data set.  It is better to keep things natural when we are dealing with random data. This will
give us option to try different sizes in each level. For random dynamic data, we will not know which size
for each level will work better. And also which property can mix with which property in which level in

more workable fashion. All these information are dynamic and real‐time. This way, we can use it for
training any type of data which need k nearest neighbor. We do not need to guide it, it will find the best
solution for it by itself. Which property will mix with which one, how many properties should be in each
level, which property mixture works best for the given training data‐ all these questions can be
answered properly by using VBCOM.

If we decide to do GA, then we will test many of the VBCOM and keep 10% of the best performer and
produce children from these to fill the new population and run the process again. We will keep running
it with predefined number of generations and get the best performer.

Method eight:
This method is same as method six, but the difference is, we will use k means clustering to get branches
in each level. At first, we have to add the values of all properties in each level or process any other
mathematical calculations and after that, we will get a total value. Then, instead of getting the highest
and lowest value and dividing it by some number to get each branch in the level, we will use k means
clustering. We will get the cluster centers and radius of each cluster. When a test data will be provided,
for each level, we will find out in which cluster it belongs, then use that branch. This way of branching a
level’s data is very dynamic and effective. We do not need to decide section length or number of
branches for a level, we will use more natural solution for it, clustering the data. By this method, each
level will have different number of branches according to the variation of data available for that level.

Method nine:
This method is same as method seven, but the difference is, we will use k means clustering to get
branches in each level. At first, we have to add the values of all properties in each level or process any
other mathematical calculations and after that, we will get a total value. Then, instead of getting the
highest and lowest value and dividing it by some number to get each branch in the level, we will use k
means clustering. We will get the cluster centers and radius for each cluster. When a test data will be
provided, we will find out in which cluster it belongs, then use that branch.

Method ten:
This method is same as method 8, but the difference is we will not use tree here. At first, we have to add
the values of all properties in each level or process any other mathematical calculations and after that,
we will get a total value. Then, instead of getting the highest and lowest value and dividing it by some
number to get each branch in the level, we will use k means clustering. We will get the cluster centers

and radius for each cluster. When a test data will be provided, we will find out in which cluster it
belongs, then use that branch. If the data has hundred levels, then for each level we will have some
clusters of data. For a test data, at first we need to break it down to level structure and after that find
out which cluster is more appropriate for which level. When we will know that, then we need to find the
data which has same cluster result or closest result. That will be the match for the test data.
Method eleven:
This method is same as method 9, but the difference is we will not use tree here. At first, we have to add
the values of all properties in each level or process any other mathematical calculations and after that,
we will get a total value. Then, instead of getting the highest and lowest value and dividing it by some
number to get each branch in the level, we will use k means clustering. We will get the cluster centers
and radius. When a test data will be provided, we will find out in which cluster it belongs, then use that
branch.

By clustering we can know that which level has more variation in data and which ones have less. We can
keep less variation levels at the top and more variation at the bottom. We want the test data to have
fewer options to get divided at the top and more at the bottom. In this way, we will have less chance to
go astray if we have little difference in test data then training data. As we will pass each level, we will
have less data remain for matching. At the top level, we will have the whole training set, but after
passing each level, we will have less data remaining for consideration for a match. We can become more
specific at the bottom as we will be close to the leaf node.

While search for data in the tree, if we do not find any match for a certain level, we can go back to
parent node and try other path and mark mismatch in one level. In that way, we will still go for match
even we mismatch one level’s data which will occur many times in real situations. Say, we have a tree of
ten levels. After passing five levels, we have found a mismatch, meaning no similar data on sixth level. In
that case, we will try the child nodes of fifth levels to continue without matching for sixth level and see
how many matches we can get by going through the child nodes to the leaf node. Whenever you have
mismatch, we will mark it and use it to find the percentage of match and try other paths to reach leaf
node. If, any one of the sixth level nodes does not go to next level, we will try the child nodes of forth
level. Basically, we will try to reach leaf node by getting max matching and also by trying alternative path
when match is not available. When we will reach leaf node, we will examine in how many levels we have
found match. For example, if we have found match on five out of ten levels, then it is a fifty percentage
match.

How to create children:

Method 1: by only flipping the levels is one of the ways to create child from parent. In this method, we
need to flip the levels randomly. Say, level one becomes level three and level three becomes level one.
As we are working with dynamic data, we might find a better result with this as well. We would not
know about it beforehand whether this will work or not. We just have to try it and find out the result.

Method 2: Flip the levels with feature changes, meaning, create the feature list again for each level. This
is same as the previous method, this time, not only flip the levels but also change the data in the level.
For example, level one has three properties and level three has five properties. Once level three
becomes level one, it will still have five properties, but different ones. We have to change the properties
for this level. We can get it by randomly choosing from the properties which are available and not taken
by other levels.

Method 3: in this method we will keep some levels unchanged while change the rest with remaining
features. For example, we have ten levels. So, keep five levels unchanged and change the structure and
properties of the remaining levels. Say, level six had three properties earlier, now randomly do selection
and say we have randomly decided to keep four properties for level six. Now, get these four properties
randomly from the remaining properties.

Method 4: in this method, we will keep the number of levels and number of features in the levels same
as before, but we will just change the feature combination. In this method, keep the level number and
number of features for each level same, but change the properties of each level. For example, level one
has three properties‐ property number 3, 5 and 7. But, in the new structure we will choose randomly to
select the properties for level one. Say, we got 4,6 and 10 in random selection, then these three
numbers will be the properties for level one.

Method 5: Randomly choose what percentage of the tree to remain unchanged. Change the remaining,
meaning, for example, randomly we have found that we want to keep 50% of it. So, keep half of the
levels of the tree and change the remaining of the tree. For the levels which will be restructured,
implement it in the same way when we have created the tree for the first time. Randomly choose how
many properties will be held in each level and which ones. Only the properties which are not used by
the upper part will be used in new levels.

How to test:
We can process the testing in many ways. One way could be to change a portion of the test sample and
keep the remaining same as before and see which ones still can recognize it. For example, change 10%
of the data of the test sample randomly, meaning randomly select which part of the data will get
changed and only change up to 10% of the original data and examine which ones work or still be able to
recognize the data.  Select the portions to change randomly so that we have better results as which
portions are getting changed, we will not know that beforehand. To reduce the number of successful
structures, we can test it again with more percent changed, for example 20%. We can keep doing this
until we get good result. This is how we can perform automatic testing. This will be a great thing for
robots if they know how to test sample data. In our sleep, our brain probably does the same thing,

simulate events to get better weights for the classifiers and also to reorganize the links which it created
on a rush. I mean, while awake and working, brain does not always select the best possible way to
perform its task. It just finds a workable way in a hurry, but, when we go to sleep, it fixes those links,
improve weights for classifiers and find better weights and links for them. That is why we need sleep so
frequently. It is like, two friends went to shopping and one friend was buying things and giving the begs
to the other person. The other person was just adding bugs in his hand in a rush and finally he got tired.
He said to his friend to wait for a moment and he did reorganization of the bugs so that he could carry
them more efficient way.

It is better to keep size bigger at the top so that we do not slip for slight difference in the training data
and test data. If we keep top layers large that means we will consider many properties at the beginning,
not specific ones. If we go for specific property, we are asking direct question and we have good
possibility to go to wrong path if data is slightly different. It is better to become more specific at the
bottom. We can also find out about variation in the data using clustering and use that information in
creating tree structure. Before creating the tree, at first we can find out which feature has how much
variation using clustering that feature’s data in the training set. Then, we can keep less variation
features at the top and more variation features at the bottom. In some scenarios, we might also want to
keep more variation ones at the top according to the demand of the assigned task.

Using these techniques, k nearest neighbor will be an intelligent one. It will find the best tree structure
which will work best with many types of training data in various fields of artificial intelligence. It will
work for large dataset – more than million of data.

k_nearest_neighbor

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

More from Mutawaqqil Billah

More from Mutawaqqil Billah (14)

k_nearest_neighbor