Data Analysis of Bank Marketing Dataset Using Logistic Regression and Naive Bayes

Data Analysis on
Bank Marketing Data Set
Anish Bhanushali

Information about dataset
• UCI machine learning repository link :
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
• This dataset has 20 attributes .
• Attribute 2 – 15 are having categorical inputs
• 21st attribute named ‘y’ is out class attribute which we want to
predict

Using logistic regression for classification
• Assign numerical values to categorical input data and normalize
numeric attributes
• To convert cat. Data into numeric we will do 1 hot encoding
• In this type of encoding if there are n distinct values are there
in the cat. Attribute then system will create a table of nx(n-1)
numerical values associated with given cat. Attribute
• In each entry of that table there will be at most one 1 and
remaining 0s will be stored

Example of cat. To numeric
• 2nd attribute job has 12 level (i.e. It’s having 12 distinct values )
• After conversion one more attribute named ‘contrastas’
• Here you can see that each value is coded into 11 bit binary
stream .

R code that converts all categorical inputs
to numeric values
colum_list = c(2,3,4,5,6,7,8,9,10,15)
for(i in colum_list)
{
n = length(levels(bank_data[[i]]))
contrasts(bank_data[[i]]) = contr.treatment(n)
}

Normalizing attributes
Following R code normalizes the attribute which are having numerical values (other than those attribute
which are having values as 0 or 1 )
normal = function(x)
{
return ((x - min(x))/(max(x) - min(x)))
}
colum_list = c(11,12,13,14,16,17,18,19,20)
for (i in colum_list)
{
bank_data[[i]] = normal(bank_data[[i]])
bank_data <<- bank_data
print(bank_data[[i]])
}

Preparing test and train data
• We are taking approx. 9% of data as test and remaining as
training data
• While dividing data into test and train we should take care about
the proportion of “yes” and “no” valued class
• In whole data set if we see 21st column then “yes” valued rows are
11% and 89% rows are having “no” as value of the same column
• we have to maintain same proportion into test data as well

R Code for making test/train set
bank_data_yes = bank_data[bank_data$y=="yes" , ]
bank_data_no = bank_data[bank_data$y=="no" , ]
true = vector('logical' ,length = 3000)
true = !true
false = vector('logical',length = (length(bank_data_no[[1]]) - 3000))
total_index_no = c(false,true)
x_no = runif(length(bank_data_no[[1]]))
total_index_no= total_index_no[order(x_no)]
test_no = bank_data_no[total_index_no ,]
This gives me total negative test set in test_no

true_yes = vector('logical' ,length = 400)
true_yes = !true_yes
false_yes = vector('logical',length = (length(bank_data_yes[[1]]) - 400))
total_index_yes = c(false_yes,true_yes)
x_yes = runif(length(bank_data_yes[[1]]))
length(x_yes)
length(total_index_yes)
total_index_yes= total_index_yes[order(x_yes)]
test_yes = bank_data_yes[total_index_yes ,]
total_test = as.data.frame(rbind(test_yes,test_no))
This gives me total positive test in test_yes then we combine them into
one dataset using rbind() method and name it as total_test

train_yes = bank_data_yes[!total_index_yes ,]
train_no = bank_data_no[!total_index_no , ]
total_train = as.data.frame(rbind(train_yes,train_no))
• These commands will make train dataset by excluding test rows
in main dataset

Using glm for logistic regression
model <- glm(total_train$y ~.,family=binomial(link='logit'),data=total_train[,-11])
• Here we have not included 11th column in train dataset because it is
clearly mention on uci repository page that ,
“this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet,
the duration is not known before a call is performed. Also, after the end of the call
y is obviously known. Thus, this input should only be included for benchmark
purposes and should be discarded if the intention is to have a realistic predictive
model.”

Summary of model
• smmary(model) command gives the output shown below and ***
indicates most relevant attribute

Predict the test data with logistic regression
model
The code below gives us the predicted output of test set notice
that here we have excluded 11th column
fitted.results <- predict(model,total_test[,-11],type='response')
fitted.results_yes_no <- ifelse(fitted.results > 0.5,"yes","no")
table(total_test$y , fitted.results_yes_no)
Here we have use the threshold value 0.5 which overall gives
good accuracy but can’t avoid huge error in ‘true positive’
prediction

Accuracy
• Confusion matrix with 0.5 as threshold
• Here we are getting over all accuracy of 89.5% but if you observe
only true values , they have accuracy of only 20.5%
• To avoid such loss we will analyze ROC curve

R code for to plot ROC curve
You’ll need “ROCR” package
require(ROCR)
pr <- prediction(fitted.results, total_test$y)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)

ROC curve
• This is the roc curve and here we can clearly
See that maximum we can have only 62%
‘true positive ’ rate
The area under this curve is given by following
Code ,
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
Value of auc is 0.7618

Increasing true positive rate
• To increase true positive rate we have to change threshold
• It was observed that if we were decreasing the threshold value
from 0.5 , it was showing increment in true positive value
• But observing the roc curve we can say that optimum true
positive rate that we could achieve is between 0.60 to 0.62
• For this process we have to slowly decrease threshold and
observe true positive rate simultaneously .

Optimal threshold is 0.12
• Here if we run this code ,
fitted.results <- predict(model,total_test[,-11],type='response')
fitted.results_yes_no <- ifelse(fitted.results > 0.12,"yes","no")
table(total_test$y , fitted.results_yes_no)
we will get this confusion matrix
Overall accuracy = 82.5% and true positive rate = 60% (0.6)

Using naïve bayes
• R code :
require(e1071)
naive_model <- naiveBayes(total_train$y ~. , data = total_train[,-11] , laplace = 0)
result = predict(naive_model , total_test[,-11])
table(total_test$y , result)
• here we get this confusion matrix
• Accuracy = 83.76 % true positive rate = 53.5% (0.535)
• NOTE : it was observed that if we use laplacian smoothing then result’s true positive rate
decreases

Using SVM with no kernel
• R code
require(e1071)
svmmod <- svm(total_train$y ~.,data = total_train[,-11] )
pred <- predict(svmmod, total_test[,c(-11,-21)], decision.values = TRUE)
table(total_test$y , pred)
This dataset is having 40k rows and svm will take huge time to
generate a predictive model out of it but you can load already
saved svm model and test your data on that.

SVM
Steps to load existing model and predict
Store ‘svm_model.rda’ file in your working directory and run this code,
load("savm_model.rda")
ls() #to check if svmmod is loaded or not
pred <- predict(svmmod, total_test[,c(-11,-21)], decision.values = TRUE)
Make sure that you include all necessary libraries before running
the ‘predict’ method

SVM accuracy
• This is the confusion matrix we got using SVM
• Overall accuracy is 89.41% but if we see the true positive rate ,
it’s 17.75%(0.177) which is very low compare to all previous
method that we saw

Final verdict
• This dataset shows good result with logistic and naïve bayes
method
• SVM is giving good accuracy but it fails in case of true positive
rate and
• This data set is having lot’s of categorical attributes that makes it
prone to be correctly classified by Decision trees

Data Analysis of Bank Marketing Dataset Using Logistic Regression and Naive Bayes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Data Analysis of Bank Marketing Dataset Using Logistic Regression and Naive Bayes

Similar to Data Analysis of Bank Marketing Dataset Using Logistic Regression and Naive Bayes (20)

Recently uploaded

Recently uploaded (20)

Data Analysis of Bank Marketing Dataset Using Logistic Regression and Naive Bayes