The KNN (K Nearest Neighbors) algorithm analyzes all available data points and classifies this data, then classifies new cases based on these established categories. It is useful for recognizing patterns and for estimating. The KNN Classification algorithm is useful in determining probable outcome and results, and in forecasting and predicting results, given the existence of multiple variables.
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
What is KNN Classification and How Can This Analysis Help an Enterprise?
1. Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
3. Basic Terminologies
Target variable usually denoted by Y , is the variable being predicted and
is also called dependent variable, output variable, response variable or
outcome variable (Ex : One highlighted in red box in table below)
Predictor, sometimes called an independent variable, is a variable that is
being used to predict the target variable ( Ex : variables highlighted in
green box in table below )
Age Marital Status Gender
Satisfaction
level
58 married Female High
44 single Female Low
33 married Male Medium
47 married Female High
33 single Female Medium
4. Introduction
• An instance (data record or case) is assigned a class, which is most common among its k nearest
neighbors
• Here, k is a positive integer, typically an odd number and ranging between 1 to 10
For instance, for k = 3, the majority class of 3 nearest neighbors
of center point shown as star in image is Class B (two out of
three circles are purple , i.e. class B) whereas for k=6, majority is
Class A (four out of 6 circles are yellow, i.e. class A)
Note : K is automatically identified by an algorithm based on the
number which gives highest classification accuracy
5. Steps :
• Determine K = number of nearest neighbors( in terms of distance) to
check for class assignment
• Calculate the distance between an instance and all the training
instances
• Rank the instances by distance and find out k nearest neighbors in
terms of shortest distance from new instance
• Gather the classes of nearest neighbors to find out the majority
• Use this majority of class as a final predicted value of a class
6. Example : Input
• Based on two attributes : Acid durability and strength, we want to classify a paper tissue into good/bad quality classes :
Acid durability ( in
Seconds)
Strength
(Kg/Square meter)
Paper tissue
Quality
7 7 Good
7 4 Good
3 4 Bad
1 4 Good
Target Variable (Y)Independent variables/predictors
7. Example : Steps :
Calculate Distance and Ranking to find K nearest neighbors
• For instance :
• If a paper tissue’s acid durability = 3 and strength = 7 then take following steps :
• Step 1 : Decide value of k ; Say it is 3 (based on classification accuracy )
• Step 2,3 : Calculate the distance between each input instance and new instance and rank each input
instance by distance to find out the k nearest neighborsAcid durability ( In
Seconds)
Strength
(Kg/Square
meter)
Paper
tissue
Quality
Distance to instance
Rank by distance to find
nearest neighbor
7 7 Good (7 -3)2 + (7-7)2 =16 3
7 4 Good (7 -3)2 + (4-7)2 =25 4
3 4 Bad (3 -3)2 + (4-7)2 =9 1
1 4 Good (1-3)2 + (4-7)2 =13 2
Input dataset Derived results to find out the majority class of k nearest neighbors
8. Step 4,5 : As the majority class =
Good for the three nearest
neighbors ( two out of three
records have class = Good) ,
predicted class of an instance =
Good, i.e. quality of a paper
tissue having acid durability =3
and strength =7 is good
Final output :
Acid durability ( In
Seconds)
Strength
(Kg/Square
meter)
Paper
tissue
Quality
7 7 Good
7 4 Good
3 4 Bad
1 4 Good
3 7 Good
Example : Steps
Select majority class of k nearest neighbors as predicted
class
9. Example : Steps
Find out Accuracy
CLASSIFICATION ACCURACY : (35+ 70) / (35+70+4+4) = 92%
• The prediction accuracy is useful criterion for assessing the model performance
• Model with prediction accuracy >= 70% is useful
CLASSIFICATION ERROR = 100- Accuracy = 8%
There is 8% chance of error in classification
Good Bad
Good 35 4
Bad 4 70
Predicted
Actual
11. Standard output 1 : Model Summary
Good Bad
Good 35 4
Bad 4 70
ACTUAL VERSUS PREDICTED
Predicted
Actual
PROFILE OF CLASSES
• Good quality class has average acid durability = 6 and Strength = 7
• Bad quality class has average acid durability = 3 and Strength = 4
13. Sample output 3 : Classification plot
• Lesser the overlap
between two classes in
the plot, better the
classification done by
model
Thus, output will contain predicted class and probability columns, confusion matrix
and classification plot
14. Limitations :
• Data needs to be scaled [(x-min(x)/max(x)-min(x)] before inputting in
the algorithm, else it can lead to high % of misclassification and in
turn low accuracy
• Not suitable for classifying categorical variables
• Individual variable importance can not be measured (which
variable(s) is most important or has high contribution in the
classification model)
• For instance, Age/income might be impactful variables or say, determinant factors when
classifying the applicants into likely defaulters/non defaulters
15. General applications
Credit/loan
approval analysis
• Given a list of client’s
transactional
attributes, predict
whether a client will
default or not on a
bank loan
Weather
Prediction
• Based on temperature,
humidity, pressure etc.
predict if it will be
rainy/sunny/cold
weather
Rain forecasting
• Based on temperature,
humidity, pressure etc.
predict if it will be
raining or not
Fraud analysis
• Based on various bills
submitted by an
employee for
reimbursement of
food , travel , medical
expense etc., predict
the likelihood of an
employee doing fraud
16. Use case 1
Business benefit:
•Once classes are assigned, bank will
have a loan applicants’ dataset with
each applicant labeled as
“likely/unlikely to default”
•Based on this labels , bank can easily
make a decision on whether to give
loan to an applicant or not and if yes
then how much credit limit and
interest rate each applicant is eligible
for based on the amount of risk
involved
Business problem :
•A bank loans officer wants to predict if
the loan applicant will be a bank
defaulter or non defaulter based on
attributes such as Loan amount ,
Monthly installment, Employment
tenure , Times delinquent, Annual
income, Debt to income ratio etc.
•Here the target variable would be ‘past
default status’ and predicted class
would be containing values ‘yes or no’
representing ‘likely to default/unlikely
to default’ class respectively
17. Use case 1 : Input Dataset
Customer
ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past default
status
1039153 21000 701.73 105000 9 5 4 No
1069697 15000 483.38 92000 11 5 2 No
1068120 25600 824.96 110000 10 9 2 No
563175 23000 534.94 80000 9 2 12 No
562842 19750 483.65 57228 11 3 21 Yes
562681 25000 571.78 113000 10 0 9 No
562404 21250 471.2 31008 12 1 12 Yes
700159 14400 448.99 82000 20 6 6 No
696484 10000 241.33 45000 18 8 2 Yes
18. Use case 1 : Output : Predicted Class
Output : Each record will have the predicted class assigned as shown below (Column :
Predicted class) :
Customer
ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past
default
status
Predicted
class
1039153 21000 701.73 105000 9 5 4 No No
1069697 15000 483.38 92000 11 5 2 No No
1068120 25600 824.96 110000 10 9 2 No No
563175 23000 534.94 80000 9 2 12 No No
562842 19750 483.65 57228 11 3 21 Yes No
562681 25000 571.78 113000 10 0 9 No No
562404 21250 471.2 31008 12 1 12 Yes Yes
700159 14400 448.99 82000 20 6 6 No No
696484 10000 241.33 45000 18 8 2 Yes Yes
19. Use case 1 : Output : Class profile
As can be seen in the table above, there are distinctive characteristics of
defaulters (Class : Yes ) and non defaulters ( Class : No )
Defaulters have tendency to be delinquent, higher debt to income ratio and lower
employment tenure as compared to non defaulters
Hence , delinquency , employment tenure and debt to income ratio are the
determinant factors when it comes to classifying loan applicants into likely
defaulter/non defaulters
Class(Likely to
default)
Average
loan
amount
Average
monthly
installment
Average
annual
income
Average debt
to income
ratio
Average
times
delinquent
Average
employment
tenure
No 10447.30 304.87 66467.74 9.58 1.69 16.82
Yes 7521.32 227.43 60935.28 16.55 6.91 4.01
20. Use case 2
Business benefit:
•Given the body profile of a patient and
recent treatments and drugs taken by
him/her , probability of a cure can be
predicted and changes in treatment/drug
can be suggested if required
Business problem :
•A doctor/ pharmacist wants to predict
the likelihood of a new patient’s disease
being cured/not cured based on various
attributes of a patient such as blood
pressure , hemoglobin level, sugar level ,
name of a drug given to patient, name of
a treatment given to patient etc.
•Here the target variable would be ‘past
cure status’ and predicted class would
contain values ‘yes or no’ meaning ‘prone
to cure/ not prone to cure’ respectively
21. Use case 3
Business benefit:
•Such classification can prevent a
company from spending unreasonably
on any employee and can in turn save
the company budget by detecting such
fraud beforehand
Business problem :
•An accountant/human resource
manager wants to predict the
likelihood of an employee doing fraud
to a company based on various bills
submitted by him/her so far such as
food bill , travel bill , medical bill
•The target variable in this case would
be ‘past fraud status’ and predicted
class would contain values ‘yes or no’
representing likely fraud and no fraud
respectively
22. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018