What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?

Master the Art of Analytics
A Simplistic Explainer Series For Citizen Data Scientists
J o u r n e y To w a r d s A u g m e n t e d A n a l y t i c s

Introduction
• Naive Bayes is a classification algorithm suitable for binary and
multiclass classification
• It’s a supervised classification technique used to classify future
objects by assigning class labels to instances /records using
conditional probability
• In supervised classification , training data are already labeled with a
class.
• For example, if fraudulent transactions are already flagged in transactional
data and if we want to classify future transactions into fraudulent/non
fraudulent then such classification is called supervised.

How it works!
• For each known class value,
• Calculate probabilities for each attribute, conditional on the class
value : P (Ai | C)
• Use the product rule to obtain a joint conditional probability for the
attributes. ∏ni=1 P (Ai | C)) P ( C )
• Once this has been done for all class values, output the class with the
highest probability.

How it works! - Example
• For example, a fruit may be considered to be an
apple if it is red, round, and about 3″ in diameter.
• Let’s say we have data on 1000 pieces of fruit.
The fruit being a Banana, Orange or some Other
fruit and imagine we know 3 features of each
fruit, whether it’s long or not, sweet or not and
yellow or not, as displayed in the table below:
• So from the table above we already know:
• 50% of the fruits are bananas
• 30% are oranges
• 20% are other fruits

• Let’s say we’re given the features of a piece of
fruit and we need to predict the fruit class.
• If we’re told that fruit is Long, Sweet and Yellow,
we can classify it using the following approach :
• Probability of a class being “Banana” given the
attributes : “Long, Sweet and Yellow” can be
calculated as below :
• P( Banana|Long,Sweet,Yellow)
• = P( Long|Banana ) * P(Sweet|Banana) *
P(Yellow|Banana) * P(Banana)
• = (400/500) * (350/500) * (450/500) *
(500/1000) = 0.8*0.7*0.9*0.5
• = 0.252

• Similarly we can find out probability
• for Orange and Other class and assign
• the class with highest probability.
• Probability of a class being “Orange” given the
attributes : “Long, Sweet and Yellow”
• P(Orange|Long,Sweet,Yellow)
• =(0/300) * (150/300 )* (300/300)*(300/1000)
= 0
• Probability of a class being “Other fruit” given the
attributes : “Long, Sweet and Yellow”
• P(Other|Long,Sweet,Yellow)
• = (100/200) * (150/200) * (50/200) *
(200/1000)
• = 0.5*0.75*0.25*0.2 = 0.01875
• Thus the fruit class identified is banana if the
attributes are “long” , “sweet” and “yellow”

Standard Tuning/Input
Parameters

Note: By default first variable is selected as label and remaining variables as features in spark
Standard Tuning/Input Parameters
Label:
Features:
Lambda/Smoothing
component:
•Modeltype:
By default
Multinomial option
should be selected
as it’s a generic
model which can be
used for binary as
well as multiclass
classification
Provision to select
predictors
/independent
variables
Provision to select
target
variable/predefined
classes
By default this
value should be set
to 1. It is for
smoothing of
categorical
variables in dataset.
It is used primarily
for scenarios when
you expect to see
attributes or data
points in test
dataset which
weren't present in
training data set

Sample UI For
Input/Tuning
Parameters & Output

Sample UI for selecting input parameters:
Select the variables you would
like to use as
predictors/features
Petal length (cm)
Petal width (cm)
Sepal length (cm)
Flower class
21
Select the variable you would
like to use as target variable
Petal length (cm)
Petal width (cm)
Sepal length (cm)
Flower class

Sample UI for tuning parameters :
Model type
Lambda
# Classes in
target variable
Multinomial
1
Three
Categorical
predictors
None
Tuning parameters
These values should be set as default
values
These should be automatically
detected based on number of
predefined categories present in
target variable
This should be automatically
detected. If none of the variables are
categorical in training set then ‘None”
should be displayed as shown.

Petal
length
(cm)
Petal
width
(cm)
Actual
Class
Predicted Class
5.1 3.5 Versicolor Versicolor
4.9 3 Virginica Setosa
4.7 3.2 Setosa Setosa
5 3.6 Virginica Virginica
5.4 3.9 Versicolor Virginica
Each instance/record is assigned a class
by the model as shown in the table
below and classification accuracy can be
shown using confusion matrix table as
shown in right:
o The prediction accuracy is useful
criterion for assessing the model
performance.
o Model with prediction accuracy >=
70% is useful.
Output will contain predicted class column and confusion matrix as shown below :
Setosa Versicolor Virginica
Setosa 50
Versicolor 42 8
Virginica 7 43
Prediction accuracy = 90%
Predicted class column
Confusion matrix :
Actual
Predicted
Sample UI for output :

A normal distribution is an arrangement of a data set in which most values cluster
in the middle of the range and the rest taper off symmetrically toward either
extreme. It will look like a bell curve as shown in right
Limitations
o Naive Bayes classifier assumes that every
feature/predictor is independent, which
isn’t always the case
o Training dataset should be adequate
enough to represent the entire population
– containing every combination of class
label and attributes
o If you don’t have occurrences of a
class label and a certain attribute
value together in training dataset (e.g.
class="nice", shape="sphere") then
the frequency-based probability
estimate will be zero for that
combination in future data
o This problem happens when we are
drawing training sample from a
population and the drawn sample is
not fully representative of the
population
o It performs well in case of categorical input
variables compared to numerical variables.
For numerical variable, normal distribution
is assumed which is a strong assumption.

General applications
Credit/loan
approval analysis
•Given a list of client’s
transactional
attributes, predict
whether a client will
default or not on a
bank loan
Medical Diagnosis
•Given a list of
symptoms, predict if a
patient has disease
X/Y/Z.
Weather
forecasting
•Based on
temperature,
humidity, pressure
etc. predict if it will be
rainy/sunny/windy
tomorrow
Treatment
effectiveness
analysis
•Based on patient’s
body attributes such
as blood pressure,
sugar, hemoglobin,
name of a drug taken,
type of a treatment
taken etc., check the
likelihood of a disease
being cured.
Fraud analysis
•Based on various bills
submitted by an
employee for
reimbursement of
food , travel , medical
expense etc., predict
the likelihood of an
employee doing fraud.

Use case 1
Business benefit:
•Once classes are assigned, bank will
have a loan applicants’ dataset with
each applicant labeled as
“likely/unlikely to default”.
•Based on this labels , bank can easily
make a decision on whether to give
loan to an applicant or not and if yes
then how much credit limit and
interest rate each applicant is eligible
for based on the amount of risk
involved.
Business problem :
•A bank loans officer wants to predict if
the loan applicant will be a bank
defaulter or non defaulter based on
attributes such as Loan amount ,
Monthly installment, Employment
tenure , Times delinquent, Annual
income, Debt to income ratio etc.
•Here the target variable would be ‘past
default status’ and predicted class
would be containing values ‘yes or no’
representing ‘likely to default/unlikely
to default’ class respectively.

Use case 1: Input dataset
Customer
ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past default
status
1039153 21000 701.73 105000 9 5 4 No
1069697 15000 483.38 92000 11 5 2 No
1068120 25600 824.96 110000 10 9 2 No
563175 23000 534.94 80000 9 2 12 No
562842 19750 483.65 57228 11 3 21 Yes
562681 25000 571.78 113000 10 0 9 No
562404 21250 471.2 31008 12 1 12 Yes
700159 14400 448.99 82000 20 6 6 No
696484 10000 241.33 45000 18 8 2 Yes
702598 11700 381.61 45192 20 7 3 Yes
702470 10000 243.29 38000 17 9 7 Yes
702373 4800 144.77 54000 19 8 2 Yes
701975 12500 455.81 43560 15 8 4 Yes

Use case 1 : Output
Customer
ID
Loan
amount
Monthly
installment
Annual
income
Debt to
income
ratio
Times
delinquent
Employment
tenure
Past
default
status
Likelihood
to default
1039153 21000 701.73 105000 9 5 4 No No
1069697 15000 483.38 92000 11 5 2 No No
1068120 25600 824.96 110000 10 9 2 No No
563175 23000 534.94 80000 9 2 12 No No
562842 19750 483.65 57228 11 3 21 Yes No
562681 25000 571.78 113000 10 0 9 No No
562404 21250 471.2 31008 12 1 12 Yes Yes
700159 14400 448.99 82000 20 6 6 No No
696484 10000 241.33 45000 18 8 2 Yes Yes
702598 11700 381.61 45192 20 7 3 Yes Yes
702470 10000 243.29 38000 17 9 7 Yes Yes
702373 4800 144.77 54000 19 8 2 Yes No
701975 12500 455.81 43560 15 8 4 Yes Yes
Each record will have the
predicted class assigned
as shown below (Column
: Likelihood to default)

Use case 1 : Output : Class profiles
 As can be seen in the table above, there are distinctive characteristics of defaulters (Class : Yes ) and non
defaulters ( Class : No ).
 Defaulters have tendency to be delinquent, higher debt to income ratio and lower employment tenure as
compared to non defaulters
 Hence , delinquency , employment tenure and debt to income ratio are the determinant factors when it
comes to classifying loan applicants into likely defaulter/non defaulters
Class(Likely to
default)
Average
loan
amount
Average
monthly
installment
Average
annual
income
Average debt
to income
ratio
Average
times
delinquent
Average
employment
tenure
No 10447.30 304.87 66467.74 9.58 1.69 16.82
Yes 7521.32 227.43 60935.28 16.55 6.91 4.01

Use case 2
Business benefit:
•Given the body profile of a patient and
recent treatments and drugs taken by
him/her , probability of a cure can be
predicted and changes in treatment/drug
can be suggested if required.
Business problem :
•A doctor/ pharmacist wants to predict
the likelihood of a new patient’s disease
being cured/not cured based on various
attributes of a patient such as blood
pressure , hemoglobin level, sugar level ,
name of a drug given to patient, name of
a treatment given to patient etc.
•Here the target variable would be ‘past
cure status’ and predicted class would
contain values ‘yes or no’ meaning ‘prone
to cure/ not prone to cure’ respectively..

Use case 3
Business benefit:
•Based on the symptoms diagnosed, a
doctor or a pharmacist can predict the
most likely disease which a patient is
suffering from and suggest the
appropriate drug/treatment
accordingly.
Business problem :
•Predict the disease based on patient’s
symptoms such as body temperature,
level of blood pressure, weakness ,
nausea, indigestion etc.
•In this case, the target variable would
be ‘past disease detected’ and
predicted class would contain values
such as ‘malaria/typhoid/allergy
rhinitis’ etc. representing name of a
likely disease.

Use case 4
Business problem :
•An accountant/human resource
manager wants to predict the
likelihood of an employee doing fraud
to a company based on various bills
submitted by him/her so far such as
food bill , travel bill , medical bill.
•The target variable in this case would
be ‘past fraud status’ and predicted
class would contain values ‘yes or no’
representing likely fraud and no fraud
respectively.
Business benefit:
•Such classification can prevent a
company from spending unreasonably
on any employee and can in turn save
the company budget by detecting such
fraud beforehand.

Want to Learn
More?
Get in touch with us @
support@Smarten.com
And Do Checkout the Learning section
on
Smarten.com
June 2018

What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?

Similar to What is Naïve Bayes Classification and How is it Used for Enterprise Analysis? (20)

More from Smarten Augmented Analytics

More from Smarten Augmented Analytics (20)

Recently uploaded

Recently uploaded (20)

What is Naïve Bayes Classification and How is it Used for Enterprise Analysis?