This document provides an introduction to support vector machine (SVM) classification. It defines key SVM terminology like target variable, predictor variables, hyperplane, support vectors, and margin. It then provides an example to demonstrate how SVM classification works by predicting a loan applicant's likelihood of default based on variables like age, marital status, income, etc. It shows the input data, output with predicted classes, and a confusion matrix to evaluate accuracy. Finally, it discusses common business use cases for SVM classification like credit approval analysis, medical diagnosis, and fraud detection.
What is SVM Classification Analysis and How Can It Benefit Business Analytics?
1. Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s
3. Basic Terminologies
Target variable usually denoted by Y , is the variable being predicted and is also called
dependent variable, output variable, response variable or outcome variable (Ex : One
highlighted in red box in table below)
Predictor, sometimes called an independent variable, is a variable that is being used to
predict the target variable ( Ex : variables highlighted in green box in table below )
Age Marital Status Gender
Satisfaction
level
58 married Female High
44 single Female Low
33 married Male Medium
47 married Female High
33 single Female Medium
4. Basic Terminologies
Hyperplane:
It is a line(in 2D) and a plane(in 3D) that
linearly separates and classifies a set of data as
shown in image in right
Support vectors :
Support vectors are the data points nearest to
the hyperplane boundary and "support" the
separation of datasets into predefined classes
Margin:
The distance between the hyperplane and the
nearest data point from either set is known as
the margin
6. Introduction
SVMs are based on the idea of finding a hyperplane that best divides a
dataset into predefined classes, as shown in the image below.
The goal is to choose a hyperplane with the
greatest possible margin between the
hyperplane and any point within the training
set, giving a greater chance of new data
being classified correctly
7. Example : Input
Let’s conduct the SVM classification on following variables :
Default Status Age Marital Status
Existing Loan
Status
Income
Defaulted 58 married no 46,399
Not Defaulted 44 single no 47,971
Defaulted 33 married yes 52,618
Defaulted 47 married no 28,717
Not Defaulted 33 single no 41,216
Defaulted 35 married no 34,372
Not Defaulted 28 single yes 64,811
Independent variables (Xi)Target Variable (Y)
8. Example : Output 1
Age
Marital
Status
Existing Loan
Status
Income Default Status Predicted class
58 married no 46,399 Defaulted Defaulted
44 single no 47,971 Not Defaulted Not Defaulted
33 married yes 52,618 Defaulted Defaulted
47 married no 28,717 Defaulted Defaulted
33 single no 41,216 Not Defaulted Not Defaulted
35 married no 34,372 Defaulted Not Defaulted
28 single yes 64,811 Not Defaulted Defaulted
Thus each existing or new instance will be assigned a predicted class
9. Example : Output 2
Classification Accuracy : (35+ 70) / (35+70+4+4) = 92%
• The prediction accuracy is useful criterion for assessing the model performance
• Model with prediction accuracy >= 70% is useful
Classification Error = 100- Accuracy = 8%
There is 8% chance of error in classification
Defaulted Not defaulted
Defaulted 35 4
Not defaulted 4 70
Actual versus predicted
Predicted
Actual
12. Age
Marital
Status
Existing Loan
Status
Income Default Status Predicted class
58 married no 46,399 Defaulted Defaulted
44 single no 47,971 Not Defaulted Not Defaulted
33 married yes 52,618 Defaulted Defaulted
47 married no 28,717 Defaulted Defaulted
33 single no 41,216 Not Defaulted Not Defaulted
35 married no 34,372 Defaulted Not Defaulted
28 single yes 64,811 Not Defaulted Defaulted
42 divorced no 53,000 Not Defaulted Not Defaulted
58 married no 41,375 Defaulted Defaulted
43 single no 53,778 Not Defaulted Defaulted
Sample output 1 : Predicted class
13. Sample output 2 : Model Summary
Default Non default
Default 35 4
Non default 4 70
ACTUAL VERSUS PREDICTED
Predicted
Actual
PROFILE OF CLASSES
Class
Average
loan
amount
Average
annual
income
Average Age
Non defaulter 10447.30 66467.74 40
Defaulter 7521.32 60935.28 26
14. Sample output 3 : Classification plot
• Lesser the overlap
between two classes in
the plot, better the
classification done by
model
Thus, output will contain predicted class column, confusion matrix , class profile and
classification plot
15. Limitations
• Processing time of SVM algorithm on large datasets can be high
• Less effective on datasets with overlapping classes
17. General applications
CREDIT/LOAN
APPROVAL ANALYSIS
•Given a list of client’s
transactional
attributes, predict
whether a client will
default or not on a
bank loan
MEDICAL DIAGNOSIS
•Given a list of
symptoms, predict if a
patient has disease X
or not
RAIN FORECASTING
•Based on
temperature,
humidity, pressure etc.
predict if it will be
raining or not
TREATMENT
EFFECTIVENESS
ANALYSIS
•Based on patient’s
body attributes such
as blood pressure,
sugar, hemoglobin,
name of a drug taken,
type of a treatment
taken etc., check the
likelihood of a disease
being cured
FRAUD ANALYSIS
•Based on various bills
submitted by an
employee for
reimbursement of
food , travel , medical
expense etc., predict
the likelihood of an
employee doing fraud
18. Use case 1
• Business problem :
• A bank loans officer wants to predict if the loan applicant will be a bank defaulter or
non defaulter based on attributes such as Loan amount , Monthly installment,
Employment tenure , Times delinquent, Annual income, Debt to income ratio etc.
• Here the target variable would be ‘past default status’ and predicted class would be
containing values ‘yes or no’ representing ‘likely to default/unlikely to default’ class
respectively
• Business benefit:
• Once classes are assigned, bank will have a loan applicants’ dataset with each
applicant labeled as “likely/unlikely to default”
• Based on this labels , bank can easily make a decision on whether to give loan to an
applicant or not and if yes then how much credit limit and interest rate each
applicant is eligible for based on the amount of risk involved
19. Use case 1 : Input Dataset
Customer ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past default
status
1039153 21000 701.73 105000 9 5 4 No
1069697 15000 483.38 92000 11 5 2 No
1068120 25600 824.96 110000 10 9 2 No
563175 23000 534.94 80000 9 2 12 No
562842 19750 483.65 57228 11 3 21 Yes
562681 25000 571.78 113000 10 0 9 No
562404 21250 471.2 31008 12 1 12 Yes
700159 14400 448.99 82000 20 6 6 No
696484 10000 241.33 45000 18 8 2 Yes
20. Use case 1 : Output : Predicted Class
Output : Each record will have the predicted class assigned as shown below (Column : Predicted class) :
Customer
ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past
default
status
Predicted
class
1039153 21000 701.73 105000 9 5 4 No No
1069697 15000 483.38 92000 11 5 2 No No
1068120 25600 824.96 110000 10 9 2 No No
563175 23000 534.94 80000 9 2 12 No No
562842 19750 483.65 57228 11 3 21 Yes No
562681 25000 571.78 113000 10 0 9 No No
562404 21250 471.2 31008 12 1 12 Yes Yes
700159 14400 448.99 82000 20 6 6 No No
696484 10000 241.33 45000 18 8 2 Yes Yes
21. Use case 1 : Output : Class profile
As can be seen in the table above, there are distinctive characteristics of defaulters (Class :
Yes ) and non defaulters ( Class : No )
Defaulters have tendency to be delinquent, higher debt to income ratio and lower
employment tenure as compared to non defaulters
Hence , delinquency , employment tenure and debt to income ratio are the determinant
factors when it comes to classifying loan applicants into likely defaulter/non defaulters
Class(Likely to
default)
Average
loan
amount
Average
monthly
installment
Average
annual
income
Average
debt to
income ratio
Average
times
delinquent
Average
employment
tenure
No 10447.3 304.87 66467.74 9.58 1.69 16.82
Yes 7521.32 227.43 60935.28 16.55 6.91 4.01
22. Use case 2
Business benefit:
•Given the body profile of a patient and
recent treatments and drugs taken by
him/her , probability of a cure can be
predicted and changes in treatment/drug
can be suggested if required
Business problem :
•A doctor/ pharmacist wants to predict
the likelihood of a new patient’s disease
being cured/not cured based on various
attributes of a patient such as blood
pressure , hemoglobin level, sugar level ,
name of a drug given to patient, name of
a treatment given to patient etc.
•Here the target variable would be ‘past
cure status’ and predicted class would
contain values ‘yes or no’ meaning ‘prone
to cure/ not prone to cure’ respectively
23. Use case 3
Business benefit:
•Such classification can prevent a
company from spending unreasonably
on any employee and can in turn save
the company budget by detecting such
fraud beforehand
Business problem :
•An accountant/human resource
manager wants to predict the
likelihood of an employee doing fraud
to a company based on various bills
submitted by him/her so far such as
food bill , travel bill , medical bill
•The target variable in this case would
be ‘past fraud status’ and predicted
class would contain values ‘yes or no’
representing likely fraud and no fraud
respectively
24. Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018