CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf

P1WU
UNIT – III: CLASSIFICATION
Topic 6: K-NN CLASSIFIER
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SEMESTER – VIII
PROFESSIONAL ELECTIVE – IV
CS8080- INFORMATION RETRIEVAL TECHNIQUES

UNIT III : TEXT CLASSIFICATION AND CLUSTERING
1.A Characterization of Text
Classification
2. Unsupervised Algorithms:
Clustering
3. Naïve Text Classification 4.
Supervised Algorithms
5. Decision Tree
6. k-NN Classifier
7. SVM Classifier
8. Feature Selection or
Dimensionality Reduction
9. Evaluation metrics
10. Accuracy and Error
11. Organizing the classes
12. Indexing and Searching
13. Inverted Indexes
14. Sequential Searching
15. Multi-dimensional
Indexing
SEMESTER – VIII

K-NN CLASSIFIER
SEMESTER – VIII

K-NN CLASSIFIER
• K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data
and available cases and put the new case into the category that is
most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity.
• This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
SEMESTER – VIII

K-NN CLASSIFIER
• supervised ML classification algorithm-KNN(K Nearest Neighbors)
algorithm.
• It is one of the simplest and widely used classification algorithms in
which a new data point is classified based on similarity in the
specific group of neighboring data points.
• This gives a competitive result.
SEMESTER – VIII

K-NN CLASSIFIER EXAMPLE
• Example: Suppose, we have an image of a creature that looks similar
to cat and dog,
• but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure.
• Our KNN model will find the similar features of the new data set to
the cats and dogs images and based on the most similar features it
will put it in either cat or dog category.
SEMESTER – VIII

K-NN CLASSIFIER EXAMPLE
SEMESTER – VIII

K Nearest Neighbor Classification
SEMESTER – VIII

INTRODUCTION TO K-NN CLASSIFIER
• K nearest neighbors is a simple algorithm that stores
• all available cases and classifies new cases based on a similarity measure (e.g.,
distance functions).
• K represents number of nearest neighbors.
• It classify an unknown example with the most common class
among k closest examples.
• KNN is based on
• “tell me who your neighbors are, and I’ll tell you who you are”
SEMESTER – VIII

INTRODUCTION TO K-NN CLASSIFIER :- Example
If K = 5, then in this case query instance xq will be classified
as negative since three of its nearest neighbors are classified
as negative.
SEMESTER – VIII

Different Schemes of KNN
• 1-Nearest Neighbor
• K-Nearest Neighbor using a majority voting scheme
• K-NN using a weighted-sum voting Scheme
SEMESTER – VIII

Different Schemes of KNN
SEMESTER – VIII

kNN: How to Choose k?
• In theory, if infinite number of samples available, the larger is k, the
better is classification
• The limitation is that all k neighbors have to be close
• Possible when infinite no of samples available
• Impossible in practice since no of samples is finite k = 1 is often used for efficiency, but
sensitive to “noise”
SEMESTER – VIII

SEMESTER – VIII

• Larger k gives smoother boundaries, better for generalization But only
if locality is preserved. Locality is not preserved if end up looking at
samples too far away, not from the same class.
• Interesting theoretical properties if k < sqrt(n), n is # of examples .
SEMESTER – VIII
Find a heuristically optimal number k of nearest
neighbors, based on RMSE(root-mean-square error).
This is done using cross validation.
Cross-validation is another way to retrospectively determine a good K value by using an independent
dataset to validate the K value. Historically, the optimal K for most datasets has been between 3-10.
That produces much better results than 1NN.

Distance Measure in KNN
• There are three distance measures are valid for continuous variables.
SEMESTER – VIII

Distance Measure in KNN
• It should also be noted that all In the instance of categorical variables the Hamming distance must be used.
• It also brings up the issue of standardization of the numerical variables between 0 and 1 when there is a
mixture of numerical and categorical variables in the dataset.
SEMESTER – VIII

Simple KNN - Algorithm:
• For each training example , add the example to the list of training_examples.
• Given a query instance xq to be classified,
• Let x1 ,x2….xk denote the k instances from training_examples that are nearest to xq .
• Return the class that represents the maximum of the k instances
• Steps:
1. Determine parameter k= no of nearest neighbor
2. Calculate the distance between the query instance and all the training samples.
3. Sort the distance and determine nearest neighbor based on the k –th minimum distance
4. Gather the category of the nearest neighbors
5. Use simple majority of the category of nearest neighbors as the prediction value of the query
instance.
SEMESTER – VIII

Simple KNN - Algorithm:
• K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
SEMESTER – VIII

Simple KNN – Algorithm Example
• Example:
• Consider the following data concerning credit default. Age and Loan are two numerical variables
(predictors) and Default is the target.
SEMESTER – VIII

• Given Training Data set :
SEMESTER – VIII

• Data to Classify:
• to classify an unknown case (Age=48 and Loan=$142,000) using Euclidean distance.
•
• Step1: Determine parameter k
• K=3
•
• Step 2: Calculate the distance
• D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.01 >> Default=Y
SEMESTER – VIII

SEMESTER – VIII

• Step 3: Sort the distance ( refer above diagram) and mark upto kth rank i.e 1 to 3.
•
• Step 4: Gather the category of the nearest neighbors
SEMESTER – VIII
Age Loan Default Distance
33 $150000 Y 8000
35 $120000 N 22000
60 $100000 Y 42000
With K=3, there are two Default=Y and one Default=N out of three closest neighbors.
The prediction for the unknown case is Default=Y.

Standardized Distance ( Feature Normalization)
• One major drawback in calculating distance measures directly from the training set is in
the case where variables have different measurement scales or there is a mixture of
numerical and categorical variables.
• For example, if one variable is based on annual income in dollars, and the other is based
on age in years then income will have a much higher influence on the distance calculated.
One solution is to standardize the training set as shown below.
SEMESTER – VIII

Standardized Distance ( Feature Normalization)
• For ex loan , X =$ 40000 ,
• Xs = 40000- 20000 = 0.11
• 220000-20000
•
Same way , calculate the standardized values for age and loan attributes, then
apply the KNN algorithm.
SEMESTER – VIII

Simple KNN – Algorithm
• Advantages
• Can be applied to the data from any distribution
• for example, data does not have to be separable with a linear boundary
• Very simple and intuitive
• Good classification if the number of samples is large enough
•
• Disadvantages
• Choosing k may be tricky
• Test stage is computationally expensive
• No training stage, all the work is done during the test stage
• This is actually the opposite of what we want. Usually we can afford training step to take a long time, but we
want fast test step
• Need large number of samples for accuracy
SEMESTER – VIII

How does K-NN work?
• he K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors
• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points in
each category.
• Step-5: Assign the new data points to that category for which the number
of the neighbor is maximum.
• Step-6: Our model is ready.
SEMESTER – VIII

How does K-NN work?
• Suppose we have a new data point and we need to put it in
the required category. Consider the below image:
SEMESTER – VIII

How does K-NN work?
• Firstly, we will choose the number of neighbors, so we will choose the k=5.
• Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the
distance between two points, which we have already studied in geometry. It can be calculated as:
SEMESTER – VIII

How does K-NN work?
• By calculating the Euclidean distance we got the nearest neighbors, as
three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:
SEMESTER – VIII

Why do we need a K-NN Algorithm?
•Suppose there are two categories, i.e., Category A
and Category B, and we have a new data point x1, so
this data point will lie in which of these categories.
•To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily
identify the category or class of a particular dataset. :
SEMESTER – VIII

Why do we need a K-NN Algorithm?
•Consider the below diagram:
SEMESTER – VIII

Any Questions?
SEMESTER – VIII

CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf

Similar to CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf (20)

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING

More from AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING (13)

Recently uploaded

Recently uploaded (20)

CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf