DATA STRUCTURES PROJECT
• Project name:
C++ implementation of KNN algorithm.
• Group members:
Muhammad Umair 2015-CS-05
Afraz Khan 2015-CS-27
Hassan Tariq 2015-CS-67
BACKGROUND
“Classification is a data mining technique used to predict group
membership for data instances.”
• The group membership is utilized in for the prediction of the
future data sets.
ORIGINS OF K-NN
• Nearest Neighbors have been used in statistical estimation and
pattern recognition already in the beginning of 1970’s (non-
parametric techniques).
• The method prevailed in several disciplines and still it is one
of the top 10 Data Mining algorithm.
DEFINITION
• K-Nearest Neighbor is considered a lazy learning algorithm
that classifies data sets based on their similarity with
neighbors.
• “K” stands for number of data set items
that are considered for the classification.
Ex: Image shows classification for different k-values.
TECHNICALLY…..
• For the given attributes A={X1, X2….. XD} Where D is the
dimension of the data, we need to predict the corresponding
classification group G={Y1,Y2…Yn} using the proximity
metric over K items in D dimension that defines the closeness
of association such that X € RD and Yp € G.
THAT IS….
• Attribute A={Color, Outline, Dot}
• Classification Group,
G={triangle, square}
• D=3, we are free to choose K value.
Attributes A
C
l
a
s
s
i
f
i
c
a
t
i
o
n
G
r
o
u
p
K-NN IN ACTION
• Consider the following data:
A={weight,color}
G={Apple(A), Banana(B)}
• We need to predict the type of a
fruit with:
weight = 378
color = red
SOME PROCESSING….
• Assign color codes to convert into numerical data:
• Let’s label Apple as “A” and
Banana as “B”
K-NN VARIATIONS
• (K-l)-NN: Reduce complexity by having a threshold on the
majority. We could restrict the associations through (K-l)-NN.
Ex: Decide if majority is over a given
threshold l. Otherwise reject.
Here, K=5 and l=4. As there is no
majority with count>4. We reject
to classify the element.
K-NN PROPERTIES
• K-NN is a lazy algorithm
• The processing defers with respect to K value.
• Result is generated after analysis of stored data.
• It neglects any intermediate values.
REMARKS: FIRST THE GOOD
Advantages
• Can be applied to the data from any distribution
for example, data does not have to be separable with a linear
boundary
• Very simple and intuitive
• Good classification if the number of samples is large enough
NOW THE BAD….
Disadvantages
• Dependent on K Value
• Test stage is computationally expensive
• No training stage, all the work is done during the test stage
• This is actually the opposite of what we want. Usually we can
afford training step to take a long time, but we want fast test step
• Need large number of samples for accuracy
C++ IMPLEMENTATION
Working process:
• Dataset input from a file.
• Input a new instance.
• Input the threshold(K).
• Classification of dataset.
• Output give most common class according to the instance
entered.
We will at 1st add our DatSet to the WEKA & also to our C++ Prog:
Weka file uploading C++ file uploading
Then user will enter a instance/TUPLE along with a K value to classify instances:
We can set value of K in WEKA same as in C++ Prog:
Just right click the algorithm applied and change K value as U want.
We will apply KNN algorithm to dataset in WEKA while in C++ we have a devised a
sorting algorithm which will give K instances closest to entered tuple.
We can see both C++ & WEKA output as approximately they are same. There may be
some differences in outputs because in C++ we enter tuple by our own will.

KNN Algorithm using C++

  • 1.
    DATA STRUCTURES PROJECT •Project name: C++ implementation of KNN algorithm. • Group members: Muhammad Umair 2015-CS-05 Afraz Khan 2015-CS-27 Hassan Tariq 2015-CS-67
  • 2.
    BACKGROUND “Classification is adata mining technique used to predict group membership for data instances.” • The group membership is utilized in for the prediction of the future data sets.
  • 3.
    ORIGINS OF K-NN •Nearest Neighbors have been used in statistical estimation and pattern recognition already in the beginning of 1970’s (non- parametric techniques). • The method prevailed in several disciplines and still it is one of the top 10 Data Mining algorithm.
  • 4.
    DEFINITION • K-Nearest Neighboris considered a lazy learning algorithm that classifies data sets based on their similarity with neighbors. • “K” stands for number of data set items that are considered for the classification. Ex: Image shows classification for different k-values.
  • 5.
    TECHNICALLY….. • For thegiven attributes A={X1, X2….. XD} Where D is the dimension of the data, we need to predict the corresponding classification group G={Y1,Y2…Yn} using the proximity metric over K items in D dimension that defines the closeness of association such that X € RD and Yp € G.
  • 6.
    THAT IS…. • AttributeA={Color, Outline, Dot} • Classification Group, G={triangle, square} • D=3, we are free to choose K value. Attributes A C l a s s i f i c a t i o n G r o u p
  • 7.
    K-NN IN ACTION •Consider the following data: A={weight,color} G={Apple(A), Banana(B)} • We need to predict the type of a fruit with: weight = 378 color = red
  • 8.
    SOME PROCESSING…. • Assigncolor codes to convert into numerical data: • Let’s label Apple as “A” and Banana as “B”
  • 9.
    K-NN VARIATIONS • (K-l)-NN:Reduce complexity by having a threshold on the majority. We could restrict the associations through (K-l)-NN. Ex: Decide if majority is over a given threshold l. Otherwise reject. Here, K=5 and l=4. As there is no majority with count>4. We reject to classify the element.
  • 10.
    K-NN PROPERTIES • K-NNis a lazy algorithm • The processing defers with respect to K value. • Result is generated after analysis of stored data. • It neglects any intermediate values.
  • 11.
    REMARKS: FIRST THEGOOD Advantages • Can be applied to the data from any distribution for example, data does not have to be separable with a linear boundary • Very simple and intuitive • Good classification if the number of samples is large enough
  • 12.
    NOW THE BAD…. Disadvantages •Dependent on K Value • Test stage is computationally expensive • No training stage, all the work is done during the test stage • This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we want fast test step • Need large number of samples for accuracy
  • 13.
    C++ IMPLEMENTATION Working process: •Dataset input from a file. • Input a new instance. • Input the threshold(K). • Classification of dataset. • Output give most common class according to the instance entered.
  • 14.
    We will at1st add our DatSet to the WEKA & also to our C++ Prog: Weka file uploading C++ file uploading
  • 15.
    Then user willenter a instance/TUPLE along with a K value to classify instances:
  • 16.
    We can setvalue of K in WEKA same as in C++ Prog: Just right click the algorithm applied and change K value as U want.
  • 17.
    We will applyKNN algorithm to dataset in WEKA while in C++ we have a devised a sorting algorithm which will give K instances closest to entered tuple.
  • 18.
    We can seeboth C++ & WEKA output as approximately they are same. There may be some differences in outputs because in C++ we enter tuple by our own will.