K nearest neighbor

What is k-Nearest Neighbor ?
• a non-parametric method used for classification and regression
• Nearest Neighbor classifiers are defined by their characteristic of
classifying unlabeled examples by assigning them the class of similar
labeled examples
• k nearest neighbors is a simple algorithm that stores all available
cases and classifies new cases by a majority vote of its k neighbors

Continued…
• This algorithm segregates unlabeled data points into well defined
groups
• k-NN is a type of instance-based learning, or lazy learning
• It does not use the training data points to do any generalization (all
the training data is needed during the testing phase)

K-NN Algorithm
• Consider an example of investigating a food type of ingredient tomato, checking
whether it is in fruit or protein food or vegetable
• We had created a dataset in which we recorded our impressions of a number of
ingredients we tasted
• To keep things simple, we rated only two features of each ingredient, first is a
measure from 1 to 10 of how crunchy the ingredient is, second is a 1 to 10 score
of how sweet

• The k-NN algorithm treats the features as coordinates in a
multidimensional feature space

• Here we can notice the pattern, similar types of food tend to be
grouped closely together

• Is tomato a fruit or vegetable?
• To answer this question we have to use Nearest Neighbor approach to
determine which class is a better fit

• Measuring similarity with distance:
• Locating the tomato’s nearest neighbors requires a distance function
• Traditionally, the k-NN algorithm uses Euclidean distance
• Suppose we are measuring distance between tomato (sweetness = 6,
crunchiness = 4), and the green bean (sweetness = 3, crunchiness = 7)
Dist.(tomato, green bean) = (6 – 3)2 + (4 - 7)2 = 4.2
• Similarly we can find the distance of tomato with other food types ingredients
• The distance between tomato and both fruits orange and grape is smaller than
remaining food types, hence tomato can be labeled as a fruit

Choosing an appropriate k
• Determining the value of k plays a significant role in determining the
efficacy of the model and generalize to future data
• Choosing a larger k value over fits the data, and smaller value of k
causes under fitting to the training data
• One common practice is to begin with k equal to the square root of
the number of training examples
• Test several k values on a variety of test datasets and choose the one
that delivers the best classification performance

Preparing data for use with k-NN
• Features are typically transformed to a standard range prior to
applying the k-NN algorithm
• Scale used for the values for each variable might be different
• The best practice is to normalize the data and transform all the values
to a common scale
• Two traditional methods for normalization:
min-max normalization
z-score standardization

1) min-max normalization:
2) Z-score standardization:

Case Study: Detecting Prostate Cancer
• The data set consists of 100 observations and 10 variables (out of
which 8 numeric variables and one categorical variable and is ID)
which are as follows:
1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness
6. Compactness
7. Symmetry
8. Fractal dimension

• Applying normalization and prediction:

• Model performance for different k values
For k = 10; accuracy = 60% For k = 9; accuracy = 68%

K-NN Algorithm: Pros and Cons
• Pros:
• The algorithm is highly unbiased in nature
• makes no prior assumption of the underlying data.
• Being simple and effective in nature, it is easy to implement and has gained good
popularity.
• Cons:
• Doesn’t create a model since there’s no abstraction process involved.
• The training process is really fast but the prediction time is pretty high with useful
insights missing at times.
• Building this algorithm requires time to be invested in data preparation (especially
treating the missing data and categorical features) to obtain a robust model.

Applications of k-NN
Agriculture:
• For instance such as simulating daily precipitations and weather variables
• Evaluation of forest inventories and estimating forest variables
• Climate forecasting and estimating soil water parameters
Finance:
• Stock market forecasting, Currency exchange rate, Bank bankruptcies, Credit rating,
Loan Management, Bank customer profiling
Medicine:
• Predict whether a patient, hospitalized due to a heart attack, will have a second
attack
• Estimating the amount of glucose in the blood of a diabetic person

K nearest neighbor

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to K nearest neighbor

Similar to K nearest neighbor (20)

Recently uploaded

Recently uploaded (20)

K nearest neighbor