1. Portuguese Bank Marketing Campaign
By
Eric Esajian, Logan Liang, Shuo Wang, Qian Zhang, Stephanus Gunawan
Executive Summary
The objective of this project is to help a Portuguese retail bank increase the success of the
telemarketing effort to sell long-term bank deposits. The Portuguese bank needs to increase its
reserve to satisfy the requirement of the regulator and increase its revenue, and this tele-
marketing effort will help Portuguese bank to reach its objective. Our group use data mining
techniques, including Decision Tree, Logistic Regression and Neural Network, to help improve
profit from the marketing campaign.
The data set we have is from the previous telemarketing campaign that has already been
conducted, including from customer information to previous call information. There is also some
external social and economic context attributes in the data, which could help us further improve
the model building. Profit and cost information cannot be obtained from the dataset we are
using. So we make some assumptions of cost and profit to calculate the total profit gained by
using our model.
After cleaning the data and making it usable in JMP, the first step we did was to create a
benchmark model. The benchmark model is logistic model with only internal variables from the
previous marketing campaign (no external variables). Because the benchmark model takes
duration variable into account, so it cannot be used as realistic prediction model. But it gives us
a benchmark to compare our later forecast models with.
The techniques we use for the forecast models are Decision Tree, Logistic Regression and
Neural Network. For each different technique, we make a forecast model. We use some
statistical parameters to be the measure metrics, as well as the profit calculated based on our
previous assumption. We acquired some insight from those models, which will be deeply
interpreted in our report. After coming up with three models, we combine those three models
together by applying Regression Model. We’ll explain what do in that part in our report. The
measure metrics and total profit gained show that our best model does in fact give better
results.
Background
A Portuguese retail bank is looking to find a way to predict the successes of telemarketing calls
to sell long-term bank deposits, ie CD’s, savings accounts, etc. In hopes of predicting these
successes, the Portuguese retail bank collected historical data from 2008 to 2013 in hopes of
2. gaining a stronger grasp of proceeding with this project. Marketing campaigns are highly
dependent upon the selling strategy just as much as it is with the product itself.
In this particular problem, telecommunication can be divided into two forms: inbound and
outbound communication. This is dependent on which the call center will be contacting. For
instance, if a current customer is calling in regards to a particular banking issue they may have,
the customer service operator could look at that customer as a warm lead to further sell them
banking services and/or processes. On the other hand, outbound calls will we further analyzed
to find leads to new customers for the bank. As a consequence of building this model, the
analysis will show significant time and cost savings in regards to the call center operations. This
includes the amount of money that the bank will pay the call center to make the calls, as well as
narrowing down the amount of persons whom will be contacted. If too many people are called,
this campaign may not be profitable. If the wrong persons are contacted, this may likely prove to
be unprofitable as well. However, if fewer, more effective calls are made to customers with the
highest propensity to buy banking products by creating these functional models, call center
operators will become much more successful in selling products and subscriptions.
Data mining and data warehousing tools will be used to select the most likely clients who will
likely subscribe to certain products. With this data, three classification models can be compared
to further show business intelligence: logistic regression, decision trees, and neural networks.
After using these models, a number of metrics can be used to help further show the benefits of
using information technology. These metrics include a lift curve, confusion matrix, ROC curve,
as well as a number of visual graphs.
3.
4. Model Building and Testing
Measure Metrics: Accuracy Rate, AUC, R square
Metrics BenchmarkModel Logistic Regression Decision Tree NeuralNetwork
AUC 0.93087 0.7198 0.7637 0.8421
Accuracy
Rate(Training)
88.60% 87.15% 85.85% 86.05%
Accuracy
Rate(Testing)
87.7% 87.25% 85.25% 85.35%
R Square 0.3649 0.1687 0.155 0.3423
Profit Analysis
To further evaluate our model, we made some assumptions on the profit and cost of the
marketing campaign.
Cost: We assume that most of the non-fixed cost of the marketing campaign is the labor cost
and some other labor-related cost. Due to the duration of the success and fail call is significantly
different, we assign them different money value of cost. From the full size data set, we found
that the average success call last about 9.22 min and fail call about 3.68 min. The average
wage of Portugal is about 1,100 Euro per month in 2010, which about 7 euro per hour. To
simplify the calculation, we assume the cost of a success call is 3 euro, a fail call is 1.5 euro.
For profit of each call, we assume each success call will save 1,000 euro deposit in average.
After considering the interest rate and bank loan rate, we make a simple assumption that each
successful call will make the bank 9 Euros on average. We calculate the profit for each models
based on the sample data.
Column1 Benchmark Decision Tree Logistic
Regression
Neural
Network
Combined
Cutoff
Rate
17% 16% 15% 18% 18%
Training € 440.50 € 337.00 € 371.50 € 481.00 € 440.00
Testing € 433.00 € 214.00 € 292. 00 € 271.00 € 311.50
5. Benchmark Model - no external variables, with duration
Logistic Regression Model
First, we make a benchmark model for our analysis. We’re using logistic regression for our
analysis. We use all the internal variables from the data set to make the logistic regression
model, including the duration. For a realistic prediction model, duration of the call is not
accessible before the call. But it will significantly affect the results of the prediction. Since the
longer the time a worker spends on the phone, the larger the probability the receiver will buy the
deposit. Our result does support that duration is the most significant variable in the logistic
regression model. The other variables that would be taken into account is pdays, month, and
contact. Our further goal is the find the models that can beat the benchmark model without
using the duration variable.
6. Forecast Model: without duration, include external variables
Logistic Regression Model
Method: First we used stepwise to find some variables that correlate with the outcome Y. We
found two variables, which are pdays and nr.employed. And then we did nominal logistics to
make model. Our model has P value < 0.0001, RSquare = 0.1687, and AUC = 0.7198.
Taking a further look of the variables, we found that for pdays, most of the data is “999”, which
means they were calling the clients who were not previously contacted. Among the 4000 in the
data set, there were 445 people who bought the long term deposit at the end. Within that 445
group, 99 of those people were clients that they contacted before. The implication is: if we call a
client that we’ve contacted before, there is around 22% chance that the client will buy our
service.
Then we took a look at profiler, we noticed that nr.employed has a negative relation with the
success. If the number of employed people in Portugal decreases, people have higher chance
of buying long-term deposits. People will feel more insecured when the unemployment rate is
high, so they are more willing to put money in the bank.
7.
8. Decision Tree Model
When we were doing decision tree model, we were trying not to over fit the model. Also, we
were trying to make the model include more business sense. The first split we used euribor3m.
If euribor3m < 1.266, there were 37% chance to buy the long-term bank deposit and there were
284 out of 2000 people. If euribor3m > 1.266, we did further split for contact communication
type. We found there was a 7% chance to buy when using cell phone as a communication
method, compared to only 4% when using telephone. The third split under the cell phone group
was based on different type of jobs. If the job types are retired, unemployed, student, admin.,
blue-collar, and entrepreneur, there were 9% chance to buy. However, the other groups only
had 2% chance to buy. The final split was under the group that contained retired, unemployed,
etc. and split by age of 50. If the age was bigger than 50, there were 13% chance to buy.
9.
10.
11. Neural Network Model
Training Data
As viewed in the Appendix (under Neural Network A.) The generalized RSquare with the given
variables we used is 0.3422. Although the RSquare would ideally be closer to 1, it is the team’s
belief that this is a strong value considering the amount of variables used, as well as the setting
in which it is being used. Within the data sets, there are not a lot of quantitative variables being
used, which is typically a tougher task to find a correlation. This data set carries with it a lot of
demographic data that although just as valuable, is often harder to use in calculating solutions in
a neural network. What we also found was that in the testing data, the ability to segregate the
yes’s and the no’s was fairly accurate. Out of the 2000 members of the testing data, 68 were
presented as success, which equated to a success rate of 32%.
12. Viewing the lift curve, the cumulative gains made by the team’s model shows that if 10% of the
targeted audience is contacted, by using this model, a lift of 5 times over the standard
procedure of using no model will be had. In other words, we will be able to reach five times
more successful targeted customers if we contact 10% of our audience, three times more
successes if we reach out to 20% of our targeted audience, etc. In this model, the red line in the
graph is the tendency for success if no model is used; the blue line is the indicator of lift where
the model is used. The value of this predictive model can allow us to target our audience in this
order giving us the highest rate of success.
13. The ROC curves were constructed by computing the sensitivity and specificity of increasing
numbers of audience, to the successes of that chosen audience. The area does measure the
ability to test and classify correctly those who will be successful signed up by the Portuguese
bank, and those who will not. The receiver operating characteristic curve shows two incredibly
important factors: The rate at which the model is able to identify true-positives, and predicting
the model's ability to gauge false-positives. Knowing that a ROC with the area of 1 is perfect,
and 0.5 is useless, it can be seen that the model we created has given us an area under the
curve is at 0.8421 which is believed to be a good test with regards to the neural network.
Testing Data
By looking at the testing data, we can see very similar results, compared to that of the training
data which tells us that this is a very strong model. The model was able to identify 133 true
positives of the testing data set.
Although there is not a given dollar amount to the amount of successes and those who the
telemarketers were unable to enroll, the model does however show and accuracy rating of
85.35% ability to accurately predict true-positives, and false-positives. By looking at the metrics
below, the model shows satisfactory ratios in terms of true positive rate of 56.84%, and a false
positive rate of 10.87%, which shows that this is a strong model.
14. Finalized Model and Business Analysis
To further improve our prediction on the result, we combine all three models together by using
Regression method. All three models give us the probability of success(1). Based on the
probability of success, we predict the result by the cutoff value of 15%. So we can get the all
the prediction result of the models. Shown as below.
Y NeuralNetwork Decision Tree Logistic Regression
0 0 0 0
0 0 0 0
0 0 0 0
1 1 0 0
0 0 0 0
0 1 1 1
0 1 1 1
0 0 0 0
Result of Prediction
After having all the prediction results, we want to use all the information from the three
prediction methods. Knowing this we did a regression analysis to get the weighted average
value of each different models. We use the coefficients of each different models to assign them
with different parameters and further calculate the probability of success. From the result, we
can see that neural network has the largest portion of the final model.
15. Model Parameter
NeuralNetwork 0.67
Decision Tree 0.21
Logistic Regression 0.11
It is worth mentioning that the team who worked on the data set prior to now and came up with
this data analyzed 52,944 records of data, which was collected from the Portuguese bank from
2008 to 2013. When the previous team began analyzing the data, there was an initial set of 150
inputs that are commonly used within the banking industry when using predictive analytics. They
used logistic regression, decision tree, neural network, and support vector machines (which we
did not use for this model). The previous team was able to narrow down the variables to 22
relevant features. For their models, they compared two critical metrics: 1.) AUC - their result
was 0.80 2.) Lift - which revealed that 79% of the successful sells could be achieved when
contacting only half of the clients given.
With our best model, neural network, we were able to achieve an AUC of 0.84, and a lift where
about 85% of the successful sells could be found with the model created. Although we were
able to beat the model of our predecessors, there were a couple factors in our analysis that may
have enabled us to do so. These factors are the following: 1.) Instead of using 52,944 records,
our team analyzed 4,000 records; used 2000 for training data, and 2000 for testing data. 2.)
During the time of the initial analysis (2008-2013), there was a severe global economic
contraction in nearly all the modern economies which could have caused the predecessors to
have lower numbers in regards to lift, accuracy, and the amount of true-positives.
Conclusion and Implication
In the telemarketing industry, optimizing targeted audience is a key driver for sales success.
More specifically, the banking industry has been under increasing pressure to increase profits
and become more efficient since the 2008 financial crisis. Because of the financial crisis,
Portuguese banks were further pressured to increase reserves of capital requirements, which is
largely why these data-driven models are such an extremely important tool when capturing this
specific audience. The more bank accounts, CDs, savings accounts the Portuguese bank can
open by selecting specific audience to minimize the cost of targeting a blanket group, the
greater the reserves within the bank, the more money the bank will be allowed to loan to
customers, the greater the ability of the bank to increase profits conservatively.
16. Appendix
Data Preview
Variables
1 - Age (numeric)
2 - Job : type of job "admin.","blue-
collar","entrepreneur","housemaid","management","retired","self-
employed","services","student","technician","unemployed","unknown"
3 - Marital : marital status (categorical: "divorced","married","single","unknown"; note:
"divorced" means divorced or widowed)
4 - Education (categorical:
"basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degre
e","unknown")
5 - Default: has credit in default? (categorical: "no","yes","unknown")
6 - Housing: has housing loan? (categorical: "no","yes","unknown")
7 - Loan: has personal loan? (categorical: "no","yes","unknown")
# related with the last contact of the current campaign:
8 - Contact: contact communication type (categorical: "cellular","telephone")
9 - Month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
17. 10 - Day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")
11 - Duration: last contact duration, in seconds (numeric). Important note: this attribute
highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known
before a call is performed. Also, after the end of the call y is obviously known. Thus, this
input should only be included for benchmark purposes and should be discarded if the
intention is to have a realistic predictive model.
# Other attributes:
12 - Campaign: number of contacts performed during this campaign and for this client
(numeric, includes last contact)
13 - Pdays: number of days that passed by after the client was last contacted from a
previous campaign (numeric; 999 means client was not previously contacted)
14 - Previous: number of contacts performed before this campaign and for this client
(numeric)
15 - Poutcome: outcome of the previous marketing campaign (categorical:
"failure","nonexistent","success")
External Variables
# Social and economic context attributes
16 - Emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - Cons.price.idx: consumer price index - monthly indicator (numeric)
18 - Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - Euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - Nr.employed: number of employees - quarterly indicator (numeric)