SlideShare a Scribd company logo
1 of 21
Portuguese Bank Marketing Campaign
By
Eric Esajian, Logan Liang, Shuo Wang, Qian Zhang, Stephanus Gunawan
Executive Summary
The objective of this project is to help a Portuguese retail bank increase the success of the
telemarketing effort to sell long-term bank deposits. The Portuguese bank needs to increase its
reserve to satisfy the requirement of the regulator and increase its revenue, and this tele-
marketing effort will help Portuguese bank to reach its objective. Our group use data mining
techniques, including Decision Tree, Logistic Regression and Neural Network, to help improve
profit from the marketing campaign.
The data set we have is from the previous telemarketing campaign that has already been
conducted, including from customer information to previous call information. There is also some
external social and economic context attributes in the data, which could help us further improve
the model building. Profit and cost information cannot be obtained from the dataset we are
using. So we make some assumptions of cost and profit to calculate the total profit gained by
using our model.
After cleaning the data and making it usable in JMP, the first step we did was to create a
benchmark model. The benchmark model is logistic model with only internal variables from the
previous marketing campaign (no external variables). Because the benchmark model takes
duration variable into account, so it cannot be used as realistic prediction model. But it gives us
a benchmark to compare our later forecast models with.
The techniques we use for the forecast models are Decision Tree, Logistic Regression and
Neural Network. For each different technique, we make a forecast model. We use some
statistical parameters to be the measure metrics, as well as the profit calculated based on our
previous assumption. We acquired some insight from those models, which will be deeply
interpreted in our report. After coming up with three models, we combine those three models
together by applying Regression Model. We’ll explain what do in that part in our report. The
measure metrics and total profit gained show that our best model does in fact give better
results.
Background
A Portuguese retail bank is looking to find a way to predict the successes of telemarketing calls
to sell long-term bank deposits, ie CD’s, savings accounts, etc. In hopes of predicting these
successes, the Portuguese retail bank collected historical data from 2008 to 2013 in hopes of
gaining a stronger grasp of proceeding with this project. Marketing campaigns are highly
dependent upon the selling strategy just as much as it is with the product itself.
In this particular problem, telecommunication can be divided into two forms: inbound and
outbound communication. This is dependent on which the call center will be contacting. For
instance, if a current customer is calling in regards to a particular banking issue they may have,
the customer service operator could look at that customer as a warm lead to further sell them
banking services and/or processes. On the other hand, outbound calls will we further analyzed
to find leads to new customers for the bank. As a consequence of building this model, the
analysis will show significant time and cost savings in regards to the call center operations. This
includes the amount of money that the bank will pay the call center to make the calls, as well as
narrowing down the amount of persons whom will be contacted. If too many people are called,
this campaign may not be profitable. If the wrong persons are contacted, this may likely prove to
be unprofitable as well. However, if fewer, more effective calls are made to customers with the
highest propensity to buy banking products by creating these functional models, call center
operators will become much more successful in selling products and subscriptions.
Data mining and data warehousing tools will be used to select the most likely clients who will
likely subscribe to certain products. With this data, three classification models can be compared
to further show business intelligence: logistic regression, decision trees, and neural networks.
After using these models, a number of metrics can be used to help further show the benefits of
using information technology. These metrics include a lift curve, confusion matrix, ROC curve,
as well as a number of visual graphs.
Model Building and Testing
Measure Metrics: Accuracy Rate, AUC, R square
Metrics BenchmarkModel Logistic Regression Decision Tree NeuralNetwork
AUC 0.93087 0.7198 0.7637 0.8421
Accuracy
Rate(Training)
88.60% 87.15% 85.85% 86.05%
Accuracy
Rate(Testing)
87.7% 87.25% 85.25% 85.35%
R Square 0.3649 0.1687 0.155 0.3423
Profit Analysis
To further evaluate our model, we made some assumptions on the profit and cost of the
marketing campaign.
Cost: We assume that most of the non-fixed cost of the marketing campaign is the labor cost
and some other labor-related cost. Due to the duration of the success and fail call is significantly
different, we assign them different money value of cost. From the full size data set, we found
that the average success call last about 9.22 min and fail call about 3.68 min. The average
wage of Portugal is about 1,100 Euro per month in 2010, which about 7 euro per hour. To
simplify the calculation, we assume the cost of a success call is 3 euro, a fail call is 1.5 euro.
For profit of each call, we assume each success call will save 1,000 euro deposit in average.
After considering the interest rate and bank loan rate, we make a simple assumption that each
successful call will make the bank 9 Euros on average. We calculate the profit for each models
based on the sample data.
Column1 Benchmark Decision Tree Logistic
Regression
Neural
Network
Combined
Cutoff
Rate
17% 16% 15% 18% 18%
Training € 440.50 € 337.00 € 371.50 € 481.00 € 440.00
Testing € 433.00 € 214.00 € 292. 00 € 271.00 € 311.50
Benchmark Model - no external variables, with duration
Logistic Regression Model
First, we make a benchmark model for our analysis. We’re using logistic regression for our
analysis. We use all the internal variables from the data set to make the logistic regression
model, including the duration. For a realistic prediction model, duration of the call is not
accessible before the call. But it will significantly affect the results of the prediction. Since the
longer the time a worker spends on the phone, the larger the probability the receiver will buy the
deposit. Our result does support that duration is the most significant variable in the logistic
regression model. The other variables that would be taken into account is pdays, month, and
contact. Our further goal is the find the models that can beat the benchmark model without
using the duration variable.
Forecast Model: without duration, include external variables
Logistic Regression Model
Method: First we used stepwise to find some variables that correlate with the outcome Y. We
found two variables, which are pdays and nr.employed. And then we did nominal logistics to
make model. Our model has P value < 0.0001, RSquare = 0.1687, and AUC = 0.7198.
Taking a further look of the variables, we found that for pdays, most of the data is “999”, which
means they were calling the clients who were not previously contacted. Among the 4000 in the
data set, there were 445 people who bought the long term deposit at the end. Within that 445
group, 99 of those people were clients that they contacted before. The implication is: if we call a
client that we’ve contacted before, there is around 22% chance that the client will buy our
service.
Then we took a look at profiler, we noticed that nr.employed has a negative relation with the
success. If the number of employed people in Portugal decreases, people have higher chance
of buying long-term deposits. People will feel more insecured when the unemployment rate is
high, so they are more willing to put money in the bank.
Decision Tree Model
When we were doing decision tree model, we were trying not to over fit the model. Also, we
were trying to make the model include more business sense. The first split we used euribor3m.
If euribor3m < 1.266, there were 37% chance to buy the long-term bank deposit and there were
284 out of 2000 people. If euribor3m > 1.266, we did further split for contact communication
type. We found there was a 7% chance to buy when using cell phone as a communication
method, compared to only 4% when using telephone. The third split under the cell phone group
was based on different type of jobs. If the job types are retired, unemployed, student, admin.,
blue-collar, and entrepreneur, there were 9% chance to buy. However, the other groups only
had 2% chance to buy. The final split was under the group that contained retired, unemployed,
etc. and split by age of 50. If the age was bigger than 50, there were 13% chance to buy.
Neural Network Model
Training Data
As viewed in the Appendix (under Neural Network A.) The generalized RSquare with the given
variables we used is 0.3422. Although the RSquare would ideally be closer to 1, it is the team’s
belief that this is a strong value considering the amount of variables used, as well as the setting
in which it is being used. Within the data sets, there are not a lot of quantitative variables being
used, which is typically a tougher task to find a correlation. This data set carries with it a lot of
demographic data that although just as valuable, is often harder to use in calculating solutions in
a neural network. What we also found was that in the testing data, the ability to segregate the
yes’s and the no’s was fairly accurate. Out of the 2000 members of the testing data, 68 were
presented as success, which equated to a success rate of 32%.
Viewing the lift curve, the cumulative gains made by the team’s model shows that if 10% of the
targeted audience is contacted, by using this model, a lift of 5 times over the standard
procedure of using no model will be had. In other words, we will be able to reach five times
more successful targeted customers if we contact 10% of our audience, three times more
successes if we reach out to 20% of our targeted audience, etc. In this model, the red line in the
graph is the tendency for success if no model is used; the blue line is the indicator of lift where
the model is used. The value of this predictive model can allow us to target our audience in this
order giving us the highest rate of success.
The ROC curves were constructed by computing the sensitivity and specificity of increasing
numbers of audience, to the successes of that chosen audience. The area does measure the
ability to test and classify correctly those who will be successful signed up by the Portuguese
bank, and those who will not. The receiver operating characteristic curve shows two incredibly
important factors: The rate at which the model is able to identify true-positives, and predicting
the model's ability to gauge false-positives. Knowing that a ROC with the area of 1 is perfect,
and 0.5 is useless, it can be seen that the model we created has given us an area under the
curve is at 0.8421 which is believed to be a good test with regards to the neural network.
Testing Data
By looking at the testing data, we can see very similar results, compared to that of the training
data which tells us that this is a very strong model. The model was able to identify 133 true
positives of the testing data set.
Although there is not a given dollar amount to the amount of successes and those who the
telemarketers were unable to enroll, the model does however show and accuracy rating of
85.35% ability to accurately predict true-positives, and false-positives. By looking at the metrics
below, the model shows satisfactory ratios in terms of true positive rate of 56.84%, and a false
positive rate of 10.87%, which shows that this is a strong model.
Finalized Model and Business Analysis
To further improve our prediction on the result, we combine all three models together by using
Regression method. All three models give us the probability of success(1). Based on the
probability of success, we predict the result by the cutoff value of 15%. So we can get the all
the prediction result of the models. Shown as below.
Y NeuralNetwork Decision Tree Logistic Regression
0 0 0 0
0 0 0 0
0 0 0 0
1 1 0 0
0 0 0 0
0 1 1 1
0 1 1 1
0 0 0 0
Result of Prediction
After having all the prediction results, we want to use all the information from the three
prediction methods. Knowing this we did a regression analysis to get the weighted average
value of each different models. We use the coefficients of each different models to assign them
with different parameters and further calculate the probability of success. From the result, we
can see that neural network has the largest portion of the final model.
Model Parameter
NeuralNetwork 0.67
Decision Tree 0.21
Logistic Regression 0.11
It is worth mentioning that the team who worked on the data set prior to now and came up with
this data analyzed 52,944 records of data, which was collected from the Portuguese bank from
2008 to 2013. When the previous team began analyzing the data, there was an initial set of 150
inputs that are commonly used within the banking industry when using predictive analytics. They
used logistic regression, decision tree, neural network, and support vector machines (which we
did not use for this model). The previous team was able to narrow down the variables to 22
relevant features. For their models, they compared two critical metrics: 1.) AUC - their result
was 0.80 2.) Lift - which revealed that 79% of the successful sells could be achieved when
contacting only half of the clients given.
With our best model, neural network, we were able to achieve an AUC of 0.84, and a lift where
about 85% of the successful sells could be found with the model created. Although we were
able to beat the model of our predecessors, there were a couple factors in our analysis that may
have enabled us to do so. These factors are the following: 1.) Instead of using 52,944 records,
our team analyzed 4,000 records; used 2000 for training data, and 2000 for testing data. 2.)
During the time of the initial analysis (2008-2013), there was a severe global economic
contraction in nearly all the modern economies which could have caused the predecessors to
have lower numbers in regards to lift, accuracy, and the amount of true-positives.
Conclusion and Implication
In the telemarketing industry, optimizing targeted audience is a key driver for sales success.
More specifically, the banking industry has been under increasing pressure to increase profits
and become more efficient since the 2008 financial crisis. Because of the financial crisis,
Portuguese banks were further pressured to increase reserves of capital requirements, which is
largely why these data-driven models are such an extremely important tool when capturing this
specific audience. The more bank accounts, CDs, savings accounts the Portuguese bank can
open by selecting specific audience to minimize the cost of targeting a blanket group, the
greater the reserves within the bank, the more money the bank will be allowed to loan to
customers, the greater the ability of the bank to increase profits conservatively.
Appendix
Data Preview
Variables
1 - Age (numeric)
2 - Job : type of job "admin.","blue-
collar","entrepreneur","housemaid","management","retired","self-
employed","services","student","technician","unemployed","unknown"
3 - Marital : marital status (categorical: "divorced","married","single","unknown"; note:
"divorced" means divorced or widowed)
4 - Education (categorical:
"basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degre
e","unknown")
5 - Default: has credit in default? (categorical: "no","yes","unknown")
6 - Housing: has housing loan? (categorical: "no","yes","unknown")
7 - Loan: has personal loan? (categorical: "no","yes","unknown")
# related with the last contact of the current campaign:
8 - Contact: contact communication type (categorical: "cellular","telephone")
9 - Month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
10 - Day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
11 - Duration: last contact duration, in seconds (numeric). Important note: this attribute
highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known
before a call is performed. Also, after the end of the call y is obviously known. Thus, this
input should only be included for benchmark purposes and should be discarded if the
intention is to have a realistic predictive model.
# Other attributes:
12 - Campaign: number of contacts performed during this campaign and for this client
(numeric, includes last contact)
13 - Pdays: number of days that passed by after the client was last contacted from a
previous campaign (numeric; 999 means client was not previously contacted)
14 - Previous: number of contacts performed before this campaign and for this client
(numeric)
15 - Poutcome: outcome of the previous marketing campaign (categorical:
"failure","nonexistent","success")
External Variables
# Social and economic context attributes
16 - Emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - Cons.price.idx: consumer price index - monthly indicator (numeric)
18 - Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - Euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - Nr.employed: number of employees - quarterly indicator (numeric)
Distribution of variables
Neural Network
Here is the contingency table for the training data set for the Neural Network
DSO528GroupProject-PortugueseBank

More Related Content

What's hot

Customer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesCustomer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesSindhujanDhayalan
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industryskewdlogix
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackPark JunPyo
 
A case study on churn analysis1
A case study on churn analysis1A case study on churn analysis1
A case study on churn analysis1Amit Kumar
 
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝Haesun Park
 
Amazon Product Review Data Analysis
Amazon Product ReviewData AnalysisAmazon Product ReviewData Analysis
Amazon Product Review Data AnalysisMonika Mishra
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET Journal
 

What's hot (8)

Customer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesCustomer churn classification using machine learning techniques
Customer churn classification using machine learning techniques
 
Churn in the Telecommunications Industry
Churn in the Telecommunications IndustryChurn in the Telecommunications Industry
Churn in the Telecommunications Industry
 
Bpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedbackBpr bayesian personalized ranking from implicit feedback
Bpr bayesian personalized ranking from implicit feedback
 
Telecom Churn Prediction
Telecom Churn PredictionTelecom Churn Prediction
Telecom Churn Prediction
 
A case study on churn analysis1
A case study on churn analysis1A case study on churn analysis1
A case study on churn analysis1
 
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝
[홍대 머신러닝 스터디 - 핸즈온 머신러닝] 1장. 한눈에 보는 머신러닝
 
Amazon Product Review Data Analysis
Amazon Product ReviewData AnalysisAmazon Product ReviewData Analysis
Amazon Product Review Data Analysis
 
IRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom IndustryIRJET - Customer Churn Analysis in Telecom Industry
IRJET - Customer Churn Analysis in Telecom Industry
 

Similar to DSO528GroupProject-PortugueseBank

Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsProduct School
 
Big Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer BehaviourBig Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer BehaviourIRJET Journal
 
Telecom analytics brochure
Telecom analytics brochure Telecom analytics brochure
Telecom analytics brochure Daniel Thomas
 
Datasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docxDatasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docxedwardmarivel
 
Econometrics Explained - IPA Report
Econometrics Explained - IPA ReportEconometrics Explained - IPA Report
Econometrics Explained - IPA ReportThink Ethnic
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaRahul Bhatia
 
Gmid associates services portfolio bank
Gmid associates  services portfolio bankGmid associates  services portfolio bank
Gmid associates services portfolio bankPankaj Jha
 
Is deep learning is a game changer for marketing analytics
Is deep learning is a game changer for marketing analyticsIs deep learning is a game changer for marketing analytics
Is deep learning is a game changer for marketing analyticsBindhuBhargaviTalasi
 
Proposed ranking for point of sales using data mining for telecom operators
Proposed ranking for point of sales using data mining for telecom operatorsProposed ranking for point of sales using data mining for telecom operators
Proposed ranking for point of sales using data mining for telecom operatorsijdms
 
Avelo_BigData_Whitepaper
Avelo_BigData_WhitepaperAvelo_BigData_Whitepaper
Avelo_BigData_WhitepaperMark Pearce
 
Statistics For Bi
Statistics For BiStatistics For Bi
Statistics For BiAngela Hays
 
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docxDescriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docxtheodorelove43763
 
How to apply CRM using data mining techniques.
How to apply CRM using data mining techniques.How to apply CRM using data mining techniques.
How to apply CRM using data mining techniques.customersforever
 
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A Survey
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A SurveyIRJET- Customer Buying Prediction using Machine-Learning Techniques: A Survey
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A SurveyIRJET Journal
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paperShubhashish Biswas
 

Similar to DSO528GroupProject-PortugueseBank (20)

Better Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data DecisionsBetter Living Through Analytics - Strategies for Data Decisions
Better Living Through Analytics - Strategies for Data Decisions
 
Big Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer BehaviourBig Data Analytics for Predicting Consumer Behaviour
Big Data Analytics for Predicting Consumer Behaviour
 
Predictive modelling
Predictive modellingPredictive modelling
Predictive modelling
 
Telecom analytics brochure
Telecom analytics brochure Telecom analytics brochure
Telecom analytics brochure
 
Datasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docxDatasets using R-StudioUsha Rani Singh.docx
Datasets using R-StudioUsha Rani Singh.docx
 
Econometrics Explained - IPA Report
Econometrics Explained - IPA ReportEconometrics Explained - IPA Report
Econometrics Explained - IPA Report
 
MIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_BhatiaMIS637_Final_Project_Rahul_Bhatia
MIS637_Final_Project_Rahul_Bhatia
 
Gmid associates services portfolio bank
Gmid associates  services portfolio bankGmid associates  services portfolio bank
Gmid associates services portfolio bank
 
Final Report
Final ReportFinal Report
Final Report
 
Is deep learning is a game changer for marketing analytics
Is deep learning is a game changer for marketing analyticsIs deep learning is a game changer for marketing analytics
Is deep learning is a game changer for marketing analytics
 
Proposed ranking for point of sales using data mining for telecom operators
Proposed ranking for point of sales using data mining for telecom operatorsProposed ranking for point of sales using data mining for telecom operators
Proposed ranking for point of sales using data mining for telecom operators
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Avelo_BigData_Whitepaper
Avelo_BigData_WhitepaperAvelo_BigData_Whitepaper
Avelo_BigData_Whitepaper
 
Manuscript dss
Manuscript dssManuscript dss
Manuscript dss
 
Day 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business AnalyticsDay 1 (Lecture 2): Business Analytics
Day 1 (Lecture 2): Business Analytics
 
Statistics For Bi
Statistics For BiStatistics For Bi
Statistics For Bi
 
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docxDescriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
 
How to apply CRM using data mining techniques.
How to apply CRM using data mining techniques.How to apply CRM using data mining techniques.
How to apply CRM using data mining techniques.
 
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A Survey
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A SurveyIRJET- Customer Buying Prediction using Machine-Learning Techniques: A Survey
IRJET- Customer Buying Prediction using Machine-Learning Techniques: A Survey
 
Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paper
 

DSO528GroupProject-PortugueseBank

  • 1. Portuguese Bank Marketing Campaign By Eric Esajian, Logan Liang, Shuo Wang, Qian Zhang, Stephanus Gunawan Executive Summary The objective of this project is to help a Portuguese retail bank increase the success of the telemarketing effort to sell long-term bank deposits. The Portuguese bank needs to increase its reserve to satisfy the requirement of the regulator and increase its revenue, and this tele- marketing effort will help Portuguese bank to reach its objective. Our group use data mining techniques, including Decision Tree, Logistic Regression and Neural Network, to help improve profit from the marketing campaign. The data set we have is from the previous telemarketing campaign that has already been conducted, including from customer information to previous call information. There is also some external social and economic context attributes in the data, which could help us further improve the model building. Profit and cost information cannot be obtained from the dataset we are using. So we make some assumptions of cost and profit to calculate the total profit gained by using our model. After cleaning the data and making it usable in JMP, the first step we did was to create a benchmark model. The benchmark model is logistic model with only internal variables from the previous marketing campaign (no external variables). Because the benchmark model takes duration variable into account, so it cannot be used as realistic prediction model. But it gives us a benchmark to compare our later forecast models with. The techniques we use for the forecast models are Decision Tree, Logistic Regression and Neural Network. For each different technique, we make a forecast model. We use some statistical parameters to be the measure metrics, as well as the profit calculated based on our previous assumption. We acquired some insight from those models, which will be deeply interpreted in our report. After coming up with three models, we combine those three models together by applying Regression Model. We’ll explain what do in that part in our report. The measure metrics and total profit gained show that our best model does in fact give better results. Background A Portuguese retail bank is looking to find a way to predict the successes of telemarketing calls to sell long-term bank deposits, ie CD’s, savings accounts, etc. In hopes of predicting these successes, the Portuguese retail bank collected historical data from 2008 to 2013 in hopes of
  • 2. gaining a stronger grasp of proceeding with this project. Marketing campaigns are highly dependent upon the selling strategy just as much as it is with the product itself. In this particular problem, telecommunication can be divided into two forms: inbound and outbound communication. This is dependent on which the call center will be contacting. For instance, if a current customer is calling in regards to a particular banking issue they may have, the customer service operator could look at that customer as a warm lead to further sell them banking services and/or processes. On the other hand, outbound calls will we further analyzed to find leads to new customers for the bank. As a consequence of building this model, the analysis will show significant time and cost savings in regards to the call center operations. This includes the amount of money that the bank will pay the call center to make the calls, as well as narrowing down the amount of persons whom will be contacted. If too many people are called, this campaign may not be profitable. If the wrong persons are contacted, this may likely prove to be unprofitable as well. However, if fewer, more effective calls are made to customers with the highest propensity to buy banking products by creating these functional models, call center operators will become much more successful in selling products and subscriptions. Data mining and data warehousing tools will be used to select the most likely clients who will likely subscribe to certain products. With this data, three classification models can be compared to further show business intelligence: logistic regression, decision trees, and neural networks. After using these models, a number of metrics can be used to help further show the benefits of using information technology. These metrics include a lift curve, confusion matrix, ROC curve, as well as a number of visual graphs.
  • 3.
  • 4. Model Building and Testing Measure Metrics: Accuracy Rate, AUC, R square Metrics BenchmarkModel Logistic Regression Decision Tree NeuralNetwork AUC 0.93087 0.7198 0.7637 0.8421 Accuracy Rate(Training) 88.60% 87.15% 85.85% 86.05% Accuracy Rate(Testing) 87.7% 87.25% 85.25% 85.35% R Square 0.3649 0.1687 0.155 0.3423 Profit Analysis To further evaluate our model, we made some assumptions on the profit and cost of the marketing campaign. Cost: We assume that most of the non-fixed cost of the marketing campaign is the labor cost and some other labor-related cost. Due to the duration of the success and fail call is significantly different, we assign them different money value of cost. From the full size data set, we found that the average success call last about 9.22 min and fail call about 3.68 min. The average wage of Portugal is about 1,100 Euro per month in 2010, which about 7 euro per hour. To simplify the calculation, we assume the cost of a success call is 3 euro, a fail call is 1.5 euro. For profit of each call, we assume each success call will save 1,000 euro deposit in average. After considering the interest rate and bank loan rate, we make a simple assumption that each successful call will make the bank 9 Euros on average. We calculate the profit for each models based on the sample data. Column1 Benchmark Decision Tree Logistic Regression Neural Network Combined Cutoff Rate 17% 16% 15% 18% 18% Training € 440.50 € 337.00 € 371.50 € 481.00 € 440.00 Testing € 433.00 € 214.00 € 292. 00 € 271.00 € 311.50
  • 5. Benchmark Model - no external variables, with duration Logistic Regression Model First, we make a benchmark model for our analysis. We’re using logistic regression for our analysis. We use all the internal variables from the data set to make the logistic regression model, including the duration. For a realistic prediction model, duration of the call is not accessible before the call. But it will significantly affect the results of the prediction. Since the longer the time a worker spends on the phone, the larger the probability the receiver will buy the deposit. Our result does support that duration is the most significant variable in the logistic regression model. The other variables that would be taken into account is pdays, month, and contact. Our further goal is the find the models that can beat the benchmark model without using the duration variable.
  • 6. Forecast Model: without duration, include external variables Logistic Regression Model Method: First we used stepwise to find some variables that correlate with the outcome Y. We found two variables, which are pdays and nr.employed. And then we did nominal logistics to make model. Our model has P value < 0.0001, RSquare = 0.1687, and AUC = 0.7198. Taking a further look of the variables, we found that for pdays, most of the data is “999”, which means they were calling the clients who were not previously contacted. Among the 4000 in the data set, there were 445 people who bought the long term deposit at the end. Within that 445 group, 99 of those people were clients that they contacted before. The implication is: if we call a client that we’ve contacted before, there is around 22% chance that the client will buy our service. Then we took a look at profiler, we noticed that nr.employed has a negative relation with the success. If the number of employed people in Portugal decreases, people have higher chance of buying long-term deposits. People will feel more insecured when the unemployment rate is high, so they are more willing to put money in the bank.
  • 7.
  • 8. Decision Tree Model When we were doing decision tree model, we were trying not to over fit the model. Also, we were trying to make the model include more business sense. The first split we used euribor3m. If euribor3m < 1.266, there were 37% chance to buy the long-term bank deposit and there were 284 out of 2000 people. If euribor3m > 1.266, we did further split for contact communication type. We found there was a 7% chance to buy when using cell phone as a communication method, compared to only 4% when using telephone. The third split under the cell phone group was based on different type of jobs. If the job types are retired, unemployed, student, admin., blue-collar, and entrepreneur, there were 9% chance to buy. However, the other groups only had 2% chance to buy. The final split was under the group that contained retired, unemployed, etc. and split by age of 50. If the age was bigger than 50, there were 13% chance to buy.
  • 9.
  • 10.
  • 11. Neural Network Model Training Data As viewed in the Appendix (under Neural Network A.) The generalized RSquare with the given variables we used is 0.3422. Although the RSquare would ideally be closer to 1, it is the team’s belief that this is a strong value considering the amount of variables used, as well as the setting in which it is being used. Within the data sets, there are not a lot of quantitative variables being used, which is typically a tougher task to find a correlation. This data set carries with it a lot of demographic data that although just as valuable, is often harder to use in calculating solutions in a neural network. What we also found was that in the testing data, the ability to segregate the yes’s and the no’s was fairly accurate. Out of the 2000 members of the testing data, 68 were presented as success, which equated to a success rate of 32%.
  • 12. Viewing the lift curve, the cumulative gains made by the team’s model shows that if 10% of the targeted audience is contacted, by using this model, a lift of 5 times over the standard procedure of using no model will be had. In other words, we will be able to reach five times more successful targeted customers if we contact 10% of our audience, three times more successes if we reach out to 20% of our targeted audience, etc. In this model, the red line in the graph is the tendency for success if no model is used; the blue line is the indicator of lift where the model is used. The value of this predictive model can allow us to target our audience in this order giving us the highest rate of success.
  • 13. The ROC curves were constructed by computing the sensitivity and specificity of increasing numbers of audience, to the successes of that chosen audience. The area does measure the ability to test and classify correctly those who will be successful signed up by the Portuguese bank, and those who will not. The receiver operating characteristic curve shows two incredibly important factors: The rate at which the model is able to identify true-positives, and predicting the model's ability to gauge false-positives. Knowing that a ROC with the area of 1 is perfect, and 0.5 is useless, it can be seen that the model we created has given us an area under the curve is at 0.8421 which is believed to be a good test with regards to the neural network. Testing Data By looking at the testing data, we can see very similar results, compared to that of the training data which tells us that this is a very strong model. The model was able to identify 133 true positives of the testing data set. Although there is not a given dollar amount to the amount of successes and those who the telemarketers were unable to enroll, the model does however show and accuracy rating of 85.35% ability to accurately predict true-positives, and false-positives. By looking at the metrics below, the model shows satisfactory ratios in terms of true positive rate of 56.84%, and a false positive rate of 10.87%, which shows that this is a strong model.
  • 14. Finalized Model and Business Analysis To further improve our prediction on the result, we combine all three models together by using Regression method. All three models give us the probability of success(1). Based on the probability of success, we predict the result by the cutoff value of 15%. So we can get the all the prediction result of the models. Shown as below. Y NeuralNetwork Decision Tree Logistic Regression 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 0 0 Result of Prediction After having all the prediction results, we want to use all the information from the three prediction methods. Knowing this we did a regression analysis to get the weighted average value of each different models. We use the coefficients of each different models to assign them with different parameters and further calculate the probability of success. From the result, we can see that neural network has the largest portion of the final model.
  • 15. Model Parameter NeuralNetwork 0.67 Decision Tree 0.21 Logistic Regression 0.11 It is worth mentioning that the team who worked on the data set prior to now and came up with this data analyzed 52,944 records of data, which was collected from the Portuguese bank from 2008 to 2013. When the previous team began analyzing the data, there was an initial set of 150 inputs that are commonly used within the banking industry when using predictive analytics. They used logistic regression, decision tree, neural network, and support vector machines (which we did not use for this model). The previous team was able to narrow down the variables to 22 relevant features. For their models, they compared two critical metrics: 1.) AUC - their result was 0.80 2.) Lift - which revealed that 79% of the successful sells could be achieved when contacting only half of the clients given. With our best model, neural network, we were able to achieve an AUC of 0.84, and a lift where about 85% of the successful sells could be found with the model created. Although we were able to beat the model of our predecessors, there were a couple factors in our analysis that may have enabled us to do so. These factors are the following: 1.) Instead of using 52,944 records, our team analyzed 4,000 records; used 2000 for training data, and 2000 for testing data. 2.) During the time of the initial analysis (2008-2013), there was a severe global economic contraction in nearly all the modern economies which could have caused the predecessors to have lower numbers in regards to lift, accuracy, and the amount of true-positives. Conclusion and Implication In the telemarketing industry, optimizing targeted audience is a key driver for sales success. More specifically, the banking industry has been under increasing pressure to increase profits and become more efficient since the 2008 financial crisis. Because of the financial crisis, Portuguese banks were further pressured to increase reserves of capital requirements, which is largely why these data-driven models are such an extremely important tool when capturing this specific audience. The more bank accounts, CDs, savings accounts the Portuguese bank can open by selecting specific audience to minimize the cost of targeting a blanket group, the greater the reserves within the bank, the more money the bank will be allowed to loan to customers, the greater the ability of the bank to increase profits conservatively.
  • 16. Appendix Data Preview Variables 1 - Age (numeric) 2 - Job : type of job "admin.","blue- collar","entrepreneur","housemaid","management","retired","self- employed","services","student","technician","unemployed","unknown" 3 - Marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed) 4 - Education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degre e","unknown") 5 - Default: has credit in default? (categorical: "no","yes","unknown") 6 - Housing: has housing loan? (categorical: "no","yes","unknown") 7 - Loan: has personal loan? (categorical: "no","yes","unknown") # related with the last contact of the current campaign: 8 - Contact: contact communication type (categorical: "cellular","telephone") 9 - Month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
  • 17. 10 - Day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri") 11 - Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. # Other attributes: 12 - Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - Previous: number of contacts performed before this campaign and for this client (numeric) 15 - Poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success") External Variables # Social and economic context attributes 16 - Emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - Cons.price.idx: consumer price index - monthly indicator (numeric) 18 - Cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - Euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - Nr.employed: number of employees - quarterly indicator (numeric)
  • 19.
  • 20. Neural Network Here is the contingency table for the training data set for the Neural Network