Classification Problem with KNN

P a g e 1 | 10
Practical Data Science
Assignment – 2
Report on Revenue Decline for
Portuguese Banking Institution
Authors:
Phalgun Haribabu Chintal, s3702107
Santhosh Kumaravel Sundaravadivelu, s3729461

P a g e 2 | 10
Table of contents
1. Introduction
2. Methodology
2.1 Data Preparation
2.2 Data Exploration
2.3 Data Modelling
3. Results
4. Discussion
5. Conclusion

P a g e 3 | 10
Abstract
The purpose of this report was to predict the subscription deposit status of every client of a
Portuguese banking institution through direct marketing campaigns. A Portuguese banking
institution attempts to raise its subscriber base. The findings suggest that Some clients faced
problems coping with their subscription of a term deposit. Overall, the result clearly depends
on the duration attribute which affects the target variable. The report concludes that if the client
has the subscription or not.
1. Introduction
Term deposits in bank operate has to gain interest when the set of amount has been deposited.
The bank has numerous rules and regulations of term deposit which symbolizes that money
should be kept for some period of time that the client agrees. Portuguese bank organization
experienced a major decline in revenue unexampled and was seeking for a solution to overcome
this drawback. Some clients faced declination of subscription in bank institution as the duration
is 0 even before the call is processed. When investigated, the central setback was that their
clients were not depositing the amount continuously. The idea that lies with term deposits is to
set for a financial gain by retaining the amount for a specific time period as it emerges in profit.
Furthermore, it also boosts the clients' chances of taking up products or insurances which gives
a footprint to increase their revenues. As per these calculations, the institution is building a gap
to overcome the problem. Since it is a classification problem we have dealt with KNN and
Decision Algorithms.
2. Methodology
2.1 Data Preparation
2.1.1 Loading packages anddataset:
By default, not all packages are loaded into the jupyter notebook. Invoke all the necessary
packages required to perform the tasks. The dataset ‘bank.csv’ is loaded into the notebook with
the help of the pandas library because it is accessible to handle data structures and data analysis
for the python language. ';' is used in separator parameter as the columns in this dataset were
separated by ';'. The dataset here contains 41188 observations along with 21 variables.
2.1.2 Setting the names of the column:
The variable in the dataset is replaced with a new name to withdraw ambiguity. A total of 21
variable names are interpolated with the column function.
2.1.3 Removalofwhitespace

P a g e 4 | 10
The observation in this dataset might contain whitespace. It is time haunting to review for
whitespace as there are 21 variables present in the bank dataset. Striping function is used to
handle all the whitespace present in variables. In the beginning, remove_whitespace is defined
along with x, which stands for every bit of variable. If a base string holds whitespace, they are
deleted, or else, the string remains as an original observation.
2.1.4 Replacing the string observations to lowerstring
The dataset carries a pack of string values, which leads in difficulty to review all the
observations. Some values might signify in uppercase, which results in an error when processed
further. The genuine recommendation for it can be performed by replacing all the string to
lower case string. Originally, start by defining remove_letter coupling with x, which stands of
whole variables. If there is a base string with upper case, they are transformed into lower case.
Or else, the string is held as an original.
2.1.5 Typo errors:
Unusually, there might be manifold typological errors exist. From the clear observation in this
dataset, there is no typos error.
2.1.6 Dealing with the missing values:
The bank dataset holds various unknown observations that persist for missing values in some
categorical attributes. With the aim of dealing with these missing observations, they are first
converted into NaN values. Following with ffill method that processes with forward filling all
the NaN values.
2.2 Data Exploration:

P a g e 5 | 10
The box plot in fig.1 signifies a method graphically describing clubs of numerical data through
their quartiles. The minimum duration is 0 whereas, 74 being the maximum amount of data.
Any anything not included represents as an outlier.
The bar chart in fig.2 shows the volume of a number of employees. In terms of 5228.1, there
are above 16000 counts. On the other hand, 5023.5 remain the least below 2000.
Fig.3 is a density graph that is used for the distribution of a numeric variable. The output of the
density curve gives a smooth histogram. The number of days passed is the value of the variable,
while density is the estimation. 1000 days passed has the highest probability. The area between
these two values results in an estimation of the probability.
The fig.4 illustrates the proportion of two types of contact type used in the Portuguese banking
institution. Kind of contacts performed by the institution is cellular which compromised of
about 60% leaving the other portion of the telephone.
The box plot in fig.5 implies a variation rate of the institution. The large portion of the cases
have a value greater than the median, and few have a value lower. It consists of one outlier
which means the values do not settle in the inner fences.
The bar graph in fig.6 demonstrates the previous contact performed in the banking
organization. It is clear that 0 was the largest contribution performed by the institution.
However, contact performed with 1 accounted for the value of just less than 5000 and followed

P a g e 6 | 10
by 2 with at least 1000 counts. Contact performed with 3 took up only a few, which was the
lowest figure in the chart.
Three months rate in fig.7 represents the density curve of the institution. There exists a peak
rise of density for the value 0.
The pie chart in fig.8 explains the outcome of the bank institution. It can be observed that the
institution had the largest portion of non-existent than other types. Failure is the second most
result followed by the least role of success.
The fig.9 represents the density graph of price_index. This describes that there exists a peak
density of 500.
The density curve in fig.10 shows there is a peak rise in 0 to 500 in the campaign of the banking
institution.
Fig.11 illustrates the relation between duration and the target variable, subscription deposit.
The duration that ranges between 0 to 2000 experienced approval of term deposit. On the other
hand, the duration that falls between 50 to 2200 had disapproval of their term deposit.
In the given fig.12, the number of employees depositing has more portion when compared to
those not depositing.
The bar chart in fig.13 of Euribor manifests equal chances to pass for acceptance and rejection
of term deposit. The figure rising from 10000 to 14000 if the Euribor is in and around 5 has
got a good portion of term deposit, on the other hand, unsuccessful status sticks to 500. If the
Euribor is between 1 and 2, and near the term 4 experienced a higher portion of success term
deposit. When the Euribor is near 1, chances are equal for both types of subscription of term
deposit.

P a g e 7 | 10
When the number of days passed is 999, a successful term deposit is about 35000, in opposite,
it is 4000 for a declined term deposit. In contrast, if the days passed is between 0 to 20, chances
for rejection of deposit is just over the successful deposit as shown in fig.14.
In the fig.15, variation rate in the values has more subscription term when compared to a failure
term deposit. The variation rate in 1 extends to 15000 subscriptions while -1 stick at the bottom.
The bar chart provides information about the price index of the bank institution as shown in
fig.16. Subscription rate of price index in 94.0 was 14000, being here than the rest of the index
by a very large margin. The price index above 94.5 is lower in both types of subscription.
Fig.17 is the bar graph between campaign and subscription that displays the most challenging
aspect from 0 to 5, which experienced a 25000 subscription. However, only less than 2500 had
a subscription between 5 to 18. While fewer did not subscribe.
Fig.18 shows the total number of subscriptions by the previous contact performed. The 0
contacts performed was fairly high with more than 35000 subscriptions. Whereas 2, 3 and 4
had equal chances of subscriptions.
2.3 Data Modeling
This is the procedure to build the model that will enable which clients are expected to subscribe
for a term deposit. The target variable has binary observations; 'yes' and 'no'. This dataset is the
classification in which it classifies the data with the help of the class label. Once the data was
examined, there were multiple categorical variables discovered. In order to fit them in the
model, categorical variables are converted into the numeric variables. When further processed,
it was seen that many variables had missing data in them so, they were removed. On the other

P a g e 8 | 10
hand, the duration will be included due to its high correlation with the clients, if they get a
subscription to the bank. Random Forest is used for feature selection, the F1 score is used as a
feature selection method. The pipeline is made to link KNN and Random forest to get the best
features together. K-Nearest Neighbors (KNN), and decision tree are the two different models
that will fit to determine their performance in predicting whether the clients are subscribed to
a term deposit in the bank or not. The data is split into test and train data such as 20% : 80%,
40% : 60%, and 50% : 50% respectively.
3. Results
Results obtained after applying both the models on the 3 splits, in the K-NN model are as
follows:
TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR
20% : 80% 0.91247 0.087521
40% : 60% 0.91090 0.089099
50% : 50% 0.90905 0.09094
Results for Decision Tree for the best score on 3 splits are as follows,
TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR
20% : 80% 0.918062 0.081938
40% : 60% 0.916788 0.083212
50% : 50% 0.914538 0.085462
Classification report for K-NN Model :
TEST TO
TRAIN
RATIO
ACCURA
CY
BEST
SCORE
INSTANCE PRECISION RECALL F1-
SCORE
20% : 80% 0.91247 0.90725 0 0.93 0.97 0.95
1 0.68 0.42 0.52
40% : 60% 0.91090 0.90830 0 0.93 0.97 0.95
1 0.66 0.44 0.53
50% : 50% 0.90905 0.90909 0 0.94 0.96 0.95

P a g e 9 | 10
1 0.62 0.50 0.55
Classification report for Decision Tree Model :
TEST TO
TRAIN
RATIO
ACCURA
CY
BEST
SCORE
INSTANCE PRECISION RECALL F1-
SCORE
20% : 80% 0.918062 0.91402 0 0.93 0.98 0.95
1 0.71 0.46 0.56
40% : 60% 0.916788 0.91299 0 0.95 0.96 0.95
1 0.65 0.56 0.60
50% : 50% 0.914538 0.91264 0 0.94 0.96 0.95
1 0.65 0.53 0.58
4. Discussion
The prediction was to determine the ways to make the client subscribe the term deposit, both
the models performed well on different circumstances like 80:20, 50:50, 60:40 splits. KNN
model performed really well because the pipeline was used along with normal classifiers in
order to get good results. Random forest was able to filter the best features and the right number
of proportion of neighbors in different circumstances. The Decision Tree was pretty much
straight in getting the results compared to the KNN model. Different Depths were explored
before selecting the right depth to get good results.
There were few limitations which were observed, the dataset result may be biased because
there is an imbalance in the target variable which in turn may affect the overall result. This can
be dealt with the undersampling or oversampling method which can be performed in the mere
future to get more accurate results.
5. Conclusion
The objective of this investigation was to discover which attribute depends on clients if it's a
term deposit or not. In this reading, a Different number of features were determined in different
circumstances in the determination to obtain a term deposit. Whereas, the rest had the smallest
influence on the decision. The duration and previous contacts performed have the main role, if
these attributes play for a longer time, the chances of subscription of term deposit are higher.
The bank can focus on the impact variables to target clients to claim a term deposit. To sum

P a g e 10 | 10
up, Decision Tree gives more score compared to the K-NN model. So Decision tree is better
compared to K-NN according the results obtained.
References
Archive.ics.uci.edu. (2019). UCI Machine Learning Repository: Bank Marketing Data Set. [online]
Available at: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing [Accessed 20 May 2019].
En.wikipedia.org. (2019). Box plot. [online] Available at: https://en.wikipedia.org/wiki/Box_plot [Acc
essed 24 May 2019].

Classification Problem with KNN

Recommended

Recommended

More Related Content

Similar to Classification Problem with KNN

Similar to Classification Problem with KNN (20)

Recently uploaded

Recently uploaded (20)

Classification Problem with KNN