SlideShare a Scribd company logo
P a g e 1 | 10
Practical Data Science
Assignment – 2
Report on Revenue Decline for
Portuguese Banking Institution
Authors:
Phalgun Haribabu Chintal, s3702107
Santhosh Kumaravel Sundaravadivelu, s3729461
P a g e 2 | 10
Table of contents
1. Introduction
2. Methodology
2.1 Data Preparation
2.2 Data Exploration
2.3 Data Modelling
3. Results
4. Discussion
5. Conclusion
P a g e 3 | 10
Abstract
The purpose of this report was to predict the subscription deposit status of every client of a
Portuguese banking institution through direct marketing campaigns. A Portuguese banking
institution attempts to raise its subscriber base. The findings suggest that Some clients faced
problems coping with their subscription of a term deposit. Overall, the result clearly depends
on the duration attribute which affects the target variable. The report concludes that if the client
has the subscription or not.
1. Introduction
Term deposits in bank operate has to gain interest when the set of amount has been deposited.
The bank has numerous rules and regulations of term deposit which symbolizes that money
should be kept for some period of time that the client agrees. Portuguese bank organization
experienced a major decline in revenue unexampled and was seeking for a solution to overcome
this drawback. Some clients faced declination of subscription in bank institution as the duration
is 0 even before the call is processed. When investigated, the central setback was that their
clients were not depositing the amount continuously. The idea that lies with term deposits is to
set for a financial gain by retaining the amount for a specific time period as it emerges in profit.
Furthermore, it also boosts the clients' chances of taking up products or insurances which gives
a footprint to increase their revenues. As per these calculations, the institution is building a gap
to overcome the problem. Since it is a classification problem we have dealt with KNN and
Decision Algorithms.
2. Methodology
2.1 Data Preparation
2.1.1 Loading packages anddataset:
By default, not all packages are loaded into the jupyter notebook. Invoke all the necessary
packages required to perform the tasks. The dataset ‘bank.csv’ is loaded into the notebook with
the help of the pandas library because it is accessible to handle data structures and data analysis
for the python language. ';' is used in separator parameter as the columns in this dataset were
separated by ';'. The dataset here contains 41188 observations along with 21 variables.
2.1.2 Setting the names of the column:
The variable in the dataset is replaced with a new name to withdraw ambiguity. A total of 21
variable names are interpolated with the column function.
2.1.3 Removalofwhitespace
P a g e 4 | 10
The observation in this dataset might contain whitespace. It is time haunting to review for
whitespace as there are 21 variables present in the bank dataset. Striping function is used to
handle all the whitespace present in variables. In the beginning, remove_whitespace is defined
along with x, which stands for every bit of variable. If a base string holds whitespace, they are
deleted, or else, the string remains as an original observation.
2.1.4 Replacing the string observations to lowerstring
The dataset carries a pack of string values, which leads in difficulty to review all the
observations. Some values might signify in uppercase, which results in an error when processed
further. The genuine recommendation for it can be performed by replacing all the string to
lower case string. Originally, start by defining remove_letter coupling with x, which stands of
whole variables. If there is a base string with upper case, they are transformed into lower case.
Or else, the string is held as an original.
2.1.5 Typo errors:
Unusually, there might be manifold typological errors exist. From the clear observation in this
dataset, there is no typos error.
2.1.6 Dealing with the missing values:
The bank dataset holds various unknown observations that persist for missing values in some
categorical attributes. With the aim of dealing with these missing observations, they are first
converted into NaN values. Following with ffill method that processes with forward filling all
the NaN values.
2.2 Data Exploration:
P a g e 5 | 10
The box plot in fig.1 signifies a method graphically describing clubs of numerical data through
their quartiles. The minimum duration is 0 whereas, 74 being the maximum amount of data.
Any anything not included represents as an outlier.
The bar chart in fig.2 shows the volume of a number of employees. In terms of 5228.1, there
are above 16000 counts. On the other hand, 5023.5 remain the least below 2000.
Fig.3 is a density graph that is used for the distribution of a numeric variable. The output of the
density curve gives a smooth histogram. The number of days passed is the value of the variable,
while density is the estimation. 1000 days passed has the highest probability. The area between
these two values results in an estimation of the probability.
The fig.4 illustrates the proportion of two types of contact type used in the Portuguese banking
institution. Kind of contacts performed by the institution is cellular which compromised of
about 60% leaving the other portion of the telephone.
The box plot in fig.5 implies a variation rate of the institution. The large portion of the cases
have a value greater than the median, and few have a value lower. It consists of one outlier
which means the values do not settle in the inner fences.
The bar graph in fig.6 demonstrates the previous contact performed in the banking
organization. It is clear that 0 was the largest contribution performed by the institution.
However, contact performed with 1 accounted for the value of just less than 5000 and followed
P a g e 6 | 10
by 2 with at least 1000 counts. Contact performed with 3 took up only a few, which was the
lowest figure in the chart.
Three months rate in fig.7 represents the density curve of the institution. There exists a peak
rise of density for the value 0.
The pie chart in fig.8 explains the outcome of the bank institution. It can be observed that the
institution had the largest portion of non-existent than other types. Failure is the second most
result followed by the least role of success.
The fig.9 represents the density graph of price_index. This describes that there exists a peak
density of 500.
The density curve in fig.10 shows there is a peak rise in 0 to 500 in the campaign of the banking
institution.
Fig.11 illustrates the relation between duration and the target variable, subscription deposit.
The duration that ranges between 0 to 2000 experienced approval of term deposit. On the other
hand, the duration that falls between 50 to 2200 had disapproval of their term deposit.
In the given fig.12, the number of employees depositing has more portion when compared to
those not depositing.
The bar chart in fig.13 of Euribor manifests equal chances to pass for acceptance and rejection
of term deposit. The figure rising from 10000 to 14000 if the Euribor is in and around 5 has
got a good portion of term deposit, on the other hand, unsuccessful status sticks to 500. If the
Euribor is between 1 and 2, and near the term 4 experienced a higher portion of success term
deposit. When the Euribor is near 1, chances are equal for both types of subscription of term
deposit.
P a g e 7 | 10
When the number of days passed is 999, a successful term deposit is about 35000, in opposite,
it is 4000 for a declined term deposit. In contrast, if the days passed is between 0 to 20, chances
for rejection of deposit is just over the successful deposit as shown in fig.14.
In the fig.15, variation rate in the values has more subscription term when compared to a failure
term deposit. The variation rate in 1 extends to 15000 subscriptions while -1 stick at the bottom.
The bar chart provides information about the price index of the bank institution as shown in
fig.16. Subscription rate of price index in 94.0 was 14000, being here than the rest of the index
by a very large margin. The price index above 94.5 is lower in both types of subscription.
Fig.17 is the bar graph between campaign and subscription that displays the most challenging
aspect from 0 to 5, which experienced a 25000 subscription. However, only less than 2500 had
a subscription between 5 to 18. While fewer did not subscribe.
Fig.18 shows the total number of subscriptions by the previous contact performed. The 0
contacts performed was fairly high with more than 35000 subscriptions. Whereas 2, 3 and 4
had equal chances of subscriptions.
2.3 Data Modeling
This is the procedure to build the model that will enable which clients are expected to subscribe
for a term deposit. The target variable has binary observations; 'yes' and 'no'. This dataset is the
classification in which it classifies the data with the help of the class label. Once the data was
examined, there were multiple categorical variables discovered. In order to fit them in the
model, categorical variables are converted into the numeric variables. When further processed,
it was seen that many variables had missing data in them so, they were removed. On the other
P a g e 8 | 10
hand, the duration will be included due to its high correlation with the clients, if they get a
subscription to the bank. Random Forest is used for feature selection, the F1 score is used as a
feature selection method. The pipeline is made to link KNN and Random forest to get the best
features together. K-Nearest Neighbors (KNN), and decision tree are the two different models
that will fit to determine their performance in predicting whether the clients are subscribed to
a term deposit in the bank or not. The data is split into test and train data such as 20% : 80%,
40% : 60%, and 50% : 50% respectively.
3. Results
Results obtained after applying both the models on the 3 splits, in the K-NN model are as
follows:
TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR
20% : 80% 0.91247 0.087521
40% : 60% 0.91090 0.089099
50% : 50% 0.90905 0.09094
Results for Decision Tree for the best score on 3 splits are as follows,
TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR
20% : 80% 0.918062 0.081938
40% : 60% 0.916788 0.083212
50% : 50% 0.914538 0.085462
Classification report for K-NN Model :
TEST TO
TRAIN
RATIO
ACCURA
CY
BEST
SCORE
INSTANCE PRECISION RECALL F1-
SCORE
20% : 80% 0.91247 0.90725 0 0.93 0.97 0.95
1 0.68 0.42 0.52
40% : 60% 0.91090 0.90830 0 0.93 0.97 0.95
1 0.66 0.44 0.53
50% : 50% 0.90905 0.90909 0 0.94 0.96 0.95
P a g e 9 | 10
1 0.62 0.50 0.55
Classification report for Decision Tree Model :
TEST TO
TRAIN
RATIO
ACCURA
CY
BEST
SCORE
INSTANCE PRECISION RECALL F1-
SCORE
20% : 80% 0.918062 0.91402 0 0.93 0.98 0.95
1 0.71 0.46 0.56
40% : 60% 0.916788 0.91299 0 0.95 0.96 0.95
1 0.65 0.56 0.60
50% : 50% 0.914538 0.91264 0 0.94 0.96 0.95
1 0.65 0.53 0.58
4. Discussion
The prediction was to determine the ways to make the client subscribe the term deposit, both
the models performed well on different circumstances like 80:20, 50:50, 60:40 splits. KNN
model performed really well because the pipeline was used along with normal classifiers in
order to get good results. Random forest was able to filter the best features and the right number
of proportion of neighbors in different circumstances. The Decision Tree was pretty much
straight in getting the results compared to the KNN model. Different Depths were explored
before selecting the right depth to get good results.
There were few limitations which were observed, the dataset result may be biased because
there is an imbalance in the target variable which in turn may affect the overall result. This can
be dealt with the undersampling or oversampling method which can be performed in the mere
future to get more accurate results.
5. Conclusion
The objective of this investigation was to discover which attribute depends on clients if it's a
term deposit or not. In this reading, a Different number of features were determined in different
circumstances in the determination to obtain a term deposit. Whereas, the rest had the smallest
influence on the decision. The duration and previous contacts performed have the main role, if
these attributes play for a longer time, the chances of subscription of term deposit are higher.
The bank can focus on the impact variables to target clients to claim a term deposit. To sum
P a g e 10 | 10
up, Decision Tree gives more score compared to the K-NN model. So Decision tree is better
compared to K-NN according the results obtained.
References
Archive.ics.uci.edu. (2019). UCI Machine Learning Repository: Bank Marketing Data Set. [online]
Available at: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing [Accessed 20 May 2019].
En.wikipedia.org. (2019). Box plot. [online] Available at: https://en.wikipedia.org/wiki/Box_plot [Acc
essed 24 May 2019].

More Related Content

Similar to Classification Problem with KNN

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Boston Institute of Analytics
 
Predicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using ClassificationPredicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using Classification
Vishva Abeyrathne
 
Report 190804110930
Report 190804110930Report 190804110930
Report 190804110930
udara12345
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
Pranov Mishra
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
Eric Esajian
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraud
Bastiaan Frerix
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
Armando Vieira
 
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docx
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docxCh05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docx
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docx
tidwellveronique
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
Saleesh Satheeshchandran
 
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...
Solventure
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
skewdlogix
 
Predictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation pricePredictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation price
kahhuey
 
DSO528GroupProject-PortugueseBank
DSO528GroupProject-PortugueseBankDSO528GroupProject-PortugueseBank
DSO528GroupProject-PortugueseBank
Eric Esajian
 
Switching Costs and Network Externalities in the Production of Payment Services
Switching Costs and Network Externalities in the Production of Payment ServicesSwitching Costs and Network Externalities in the Production of Payment Services
Switching Costs and Network Externalities in the Production of Payment Services
Palkansaajien tutkimuslaitos
 
Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.
Siddhanth Chaurasiya
 
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonali
Sonali Gupta
 
File 498 Doc 27 03dm Exploratorydataanalysis
File 498 Doc 27 03dm ExploratorydataanalysisFile 498 Doc 27 03dm Exploratorydataanalysis
File 498 Doc 27 03dm Exploratorydataanalysis
mupa
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
Usha Vijay
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
Armando Vieira
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
Armando Vieira
 

Similar to Classification Problem with KNN (20)

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Predicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using ClassificationPredicting Bank Customer Churn Using Classification
Predicting Bank Customer Churn Using Classification
 
Report 190804110930
Report 190804110930Report 190804110930
Report 190804110930
 
Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraud
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docx
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docxCh05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docx
Ch05 P24 Build a Model Spring 1, 201372212Chapter 5. Ch 05 P24 B.docx
 
Telecom customer churn prediction
Telecom customer churn predictionTelecom customer churn prediction
Telecom customer churn prediction
 
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...
Chapter 2 - Benchmarking and the Best Practice Frontier in the Supply Chain T...
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 
Predictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation pricePredictive modeling for resale hdb evaluation price
Predictive modeling for resale hdb evaluation price
 
DSO528GroupProject-PortugueseBank
DSO528GroupProject-PortugueseBankDSO528GroupProject-PortugueseBank
DSO528GroupProject-PortugueseBank
 
Switching Costs and Network Externalities in the Production of Payment Services
Switching Costs and Network Externalities in the Production of Payment ServicesSwitching Costs and Network Externalities in the Production of Payment Services
Switching Costs and Network Externalities in the Production of Payment Services
 
Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.Predictive Modelling & Market-Basket Analysis.
Predictive Modelling & Market-Basket Analysis.
 
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonali
 
File 498 Doc 27 03dm Exploratorydataanalysis
File 498 Doc 27 03dm ExploratorydataanalysisFile 498 Doc 27 03dm Exploratorydataanalysis
File 498 Doc 27 03dm Exploratorydataanalysis
 
Principal Component Analysis and Clustering
Principal Component Analysis and ClusteringPrincipal Component Analysis and Clustering
Principal Component Analysis and Clustering
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 

Recently uploaded

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
mkkikqvo
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 

Recently uploaded (20)

Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
原版一比一多伦多大学毕业证(UofT毕业证书)如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 

Classification Problem with KNN

  • 1. P a g e 1 | 10 Practical Data Science Assignment – 2 Report on Revenue Decline for Portuguese Banking Institution Authors: Phalgun Haribabu Chintal, s3702107 Santhosh Kumaravel Sundaravadivelu, s3729461
  • 2. P a g e 2 | 10 Table of contents 1. Introduction 2. Methodology 2.1 Data Preparation 2.2 Data Exploration 2.3 Data Modelling 3. Results 4. Discussion 5. Conclusion
  • 3. P a g e 3 | 10 Abstract The purpose of this report was to predict the subscription deposit status of every client of a Portuguese banking institution through direct marketing campaigns. A Portuguese banking institution attempts to raise its subscriber base. The findings suggest that Some clients faced problems coping with their subscription of a term deposit. Overall, the result clearly depends on the duration attribute which affects the target variable. The report concludes that if the client has the subscription or not. 1. Introduction Term deposits in bank operate has to gain interest when the set of amount has been deposited. The bank has numerous rules and regulations of term deposit which symbolizes that money should be kept for some period of time that the client agrees. Portuguese bank organization experienced a major decline in revenue unexampled and was seeking for a solution to overcome this drawback. Some clients faced declination of subscription in bank institution as the duration is 0 even before the call is processed. When investigated, the central setback was that their clients were not depositing the amount continuously. The idea that lies with term deposits is to set for a financial gain by retaining the amount for a specific time period as it emerges in profit. Furthermore, it also boosts the clients' chances of taking up products or insurances which gives a footprint to increase their revenues. As per these calculations, the institution is building a gap to overcome the problem. Since it is a classification problem we have dealt with KNN and Decision Algorithms. 2. Methodology 2.1 Data Preparation 2.1.1 Loading packages anddataset: By default, not all packages are loaded into the jupyter notebook. Invoke all the necessary packages required to perform the tasks. The dataset ‘bank.csv’ is loaded into the notebook with the help of the pandas library because it is accessible to handle data structures and data analysis for the python language. ';' is used in separator parameter as the columns in this dataset were separated by ';'. The dataset here contains 41188 observations along with 21 variables. 2.1.2 Setting the names of the column: The variable in the dataset is replaced with a new name to withdraw ambiguity. A total of 21 variable names are interpolated with the column function. 2.1.3 Removalofwhitespace
  • 4. P a g e 4 | 10 The observation in this dataset might contain whitespace. It is time haunting to review for whitespace as there are 21 variables present in the bank dataset. Striping function is used to handle all the whitespace present in variables. In the beginning, remove_whitespace is defined along with x, which stands for every bit of variable. If a base string holds whitespace, they are deleted, or else, the string remains as an original observation. 2.1.4 Replacing the string observations to lowerstring The dataset carries a pack of string values, which leads in difficulty to review all the observations. Some values might signify in uppercase, which results in an error when processed further. The genuine recommendation for it can be performed by replacing all the string to lower case string. Originally, start by defining remove_letter coupling with x, which stands of whole variables. If there is a base string with upper case, they are transformed into lower case. Or else, the string is held as an original. 2.1.5 Typo errors: Unusually, there might be manifold typological errors exist. From the clear observation in this dataset, there is no typos error. 2.1.6 Dealing with the missing values: The bank dataset holds various unknown observations that persist for missing values in some categorical attributes. With the aim of dealing with these missing observations, they are first converted into NaN values. Following with ffill method that processes with forward filling all the NaN values. 2.2 Data Exploration:
  • 5. P a g e 5 | 10 The box plot in fig.1 signifies a method graphically describing clubs of numerical data through their quartiles. The minimum duration is 0 whereas, 74 being the maximum amount of data. Any anything not included represents as an outlier. The bar chart in fig.2 shows the volume of a number of employees. In terms of 5228.1, there are above 16000 counts. On the other hand, 5023.5 remain the least below 2000. Fig.3 is a density graph that is used for the distribution of a numeric variable. The output of the density curve gives a smooth histogram. The number of days passed is the value of the variable, while density is the estimation. 1000 days passed has the highest probability. The area between these two values results in an estimation of the probability. The fig.4 illustrates the proportion of two types of contact type used in the Portuguese banking institution. Kind of contacts performed by the institution is cellular which compromised of about 60% leaving the other portion of the telephone. The box plot in fig.5 implies a variation rate of the institution. The large portion of the cases have a value greater than the median, and few have a value lower. It consists of one outlier which means the values do not settle in the inner fences. The bar graph in fig.6 demonstrates the previous contact performed in the banking organization. It is clear that 0 was the largest contribution performed by the institution. However, contact performed with 1 accounted for the value of just less than 5000 and followed
  • 6. P a g e 6 | 10 by 2 with at least 1000 counts. Contact performed with 3 took up only a few, which was the lowest figure in the chart. Three months rate in fig.7 represents the density curve of the institution. There exists a peak rise of density for the value 0. The pie chart in fig.8 explains the outcome of the bank institution. It can be observed that the institution had the largest portion of non-existent than other types. Failure is the second most result followed by the least role of success. The fig.9 represents the density graph of price_index. This describes that there exists a peak density of 500. The density curve in fig.10 shows there is a peak rise in 0 to 500 in the campaign of the banking institution. Fig.11 illustrates the relation between duration and the target variable, subscription deposit. The duration that ranges between 0 to 2000 experienced approval of term deposit. On the other hand, the duration that falls between 50 to 2200 had disapproval of their term deposit. In the given fig.12, the number of employees depositing has more portion when compared to those not depositing. The bar chart in fig.13 of Euribor manifests equal chances to pass for acceptance and rejection of term deposit. The figure rising from 10000 to 14000 if the Euribor is in and around 5 has got a good portion of term deposit, on the other hand, unsuccessful status sticks to 500. If the Euribor is between 1 and 2, and near the term 4 experienced a higher portion of success term deposit. When the Euribor is near 1, chances are equal for both types of subscription of term deposit.
  • 7. P a g e 7 | 10 When the number of days passed is 999, a successful term deposit is about 35000, in opposite, it is 4000 for a declined term deposit. In contrast, if the days passed is between 0 to 20, chances for rejection of deposit is just over the successful deposit as shown in fig.14. In the fig.15, variation rate in the values has more subscription term when compared to a failure term deposit. The variation rate in 1 extends to 15000 subscriptions while -1 stick at the bottom. The bar chart provides information about the price index of the bank institution as shown in fig.16. Subscription rate of price index in 94.0 was 14000, being here than the rest of the index by a very large margin. The price index above 94.5 is lower in both types of subscription. Fig.17 is the bar graph between campaign and subscription that displays the most challenging aspect from 0 to 5, which experienced a 25000 subscription. However, only less than 2500 had a subscription between 5 to 18. While fewer did not subscribe. Fig.18 shows the total number of subscriptions by the previous contact performed. The 0 contacts performed was fairly high with more than 35000 subscriptions. Whereas 2, 3 and 4 had equal chances of subscriptions. 2.3 Data Modeling This is the procedure to build the model that will enable which clients are expected to subscribe for a term deposit. The target variable has binary observations; 'yes' and 'no'. This dataset is the classification in which it classifies the data with the help of the class label. Once the data was examined, there were multiple categorical variables discovered. In order to fit them in the model, categorical variables are converted into the numeric variables. When further processed, it was seen that many variables had missing data in them so, they were removed. On the other
  • 8. P a g e 8 | 10 hand, the duration will be included due to its high correlation with the clients, if they get a subscription to the bank. Random Forest is used for feature selection, the F1 score is used as a feature selection method. The pipeline is made to link KNN and Random forest to get the best features together. K-Nearest Neighbors (KNN), and decision tree are the two different models that will fit to determine their performance in predicting whether the clients are subscribed to a term deposit in the bank or not. The data is split into test and train data such as 20% : 80%, 40% : 60%, and 50% : 50% respectively. 3. Results Results obtained after applying both the models on the 3 splits, in the K-NN model are as follows: TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR 20% : 80% 0.91247 0.087521 40% : 60% 0.91090 0.089099 50% : 50% 0.90905 0.09094 Results for Decision Tree for the best score on 3 splits are as follows, TEST TO TRAIN RATIO ACCURACY CLASSIFICATION ERROR 20% : 80% 0.918062 0.081938 40% : 60% 0.916788 0.083212 50% : 50% 0.914538 0.085462 Classification report for K-NN Model : TEST TO TRAIN RATIO ACCURA CY BEST SCORE INSTANCE PRECISION RECALL F1- SCORE 20% : 80% 0.91247 0.90725 0 0.93 0.97 0.95 1 0.68 0.42 0.52 40% : 60% 0.91090 0.90830 0 0.93 0.97 0.95 1 0.66 0.44 0.53 50% : 50% 0.90905 0.90909 0 0.94 0.96 0.95
  • 9. P a g e 9 | 10 1 0.62 0.50 0.55 Classification report for Decision Tree Model : TEST TO TRAIN RATIO ACCURA CY BEST SCORE INSTANCE PRECISION RECALL F1- SCORE 20% : 80% 0.918062 0.91402 0 0.93 0.98 0.95 1 0.71 0.46 0.56 40% : 60% 0.916788 0.91299 0 0.95 0.96 0.95 1 0.65 0.56 0.60 50% : 50% 0.914538 0.91264 0 0.94 0.96 0.95 1 0.65 0.53 0.58 4. Discussion The prediction was to determine the ways to make the client subscribe the term deposit, both the models performed well on different circumstances like 80:20, 50:50, 60:40 splits. KNN model performed really well because the pipeline was used along with normal classifiers in order to get good results. Random forest was able to filter the best features and the right number of proportion of neighbors in different circumstances. The Decision Tree was pretty much straight in getting the results compared to the KNN model. Different Depths were explored before selecting the right depth to get good results. There were few limitations which were observed, the dataset result may be biased because there is an imbalance in the target variable which in turn may affect the overall result. This can be dealt with the undersampling or oversampling method which can be performed in the mere future to get more accurate results. 5. Conclusion The objective of this investigation was to discover which attribute depends on clients if it's a term deposit or not. In this reading, a Different number of features were determined in different circumstances in the determination to obtain a term deposit. Whereas, the rest had the smallest influence on the decision. The duration and previous contacts performed have the main role, if these attributes play for a longer time, the chances of subscription of term deposit are higher. The bank can focus on the impact variables to target clients to claim a term deposit. To sum
  • 10. P a g e 10 | 10 up, Decision Tree gives more score compared to the K-NN model. So Decision tree is better compared to K-NN according the results obtained. References Archive.ics.uci.edu. (2019). UCI Machine Learning Repository: Bank Marketing Data Set. [online] Available at: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing [Accessed 20 May 2019]. En.wikipedia.org. (2019). Box plot. [online] Available at: https://en.wikipedia.org/wiki/Box_plot [Acc essed 24 May 2019].