Akanksha final case_a_bank revenues

Final Case A: Bank
Revenues
Quest to best model that predicts profitability for a given customer
Akanksha Sinha 10/20/19 Introduction to Data Mining

Final Case A:Bank Revenues-PredictiveStatistical Analysis 1
AkankshaSinha© 2019
Profitability Prediction Statistical Analysis
Assignment Name: Final Case A| Dataset: Bank Revenues
DSS 660 | Fall 2019 | Under supervision of Prof. Dr. Ronald Klimberg
Submitted by Akanksha Sinha
This Photo byUnknown Author is licensedunder CCBY-NC

Contents
Overview...........................................................................................................................................4
Objective:......................................................................................................................................4
Data:.............................................................................................................................................4
Stage 1: Descriptive Statistics ...........................................................................................................5
Initial Findings:............................................................................................................................5
Taking Logs of Rev_Total & Bal_Total.......................................................................................10
....................................................................................................................................................10
Correlations ................................................................................................................................11
Treating Missing Data & Outliers...............................................................................................11
Stage 2: Running Models ................................................................................................................13
Section 1: Find a good/“best” model using multiple linear regression.........................................13
Section 1.1:Multiple Linear Regression using all variables.........................................................14
Section 1.2:Model Improvement Steps.......................................................................................16
Section 2: Find a good/“best” model using Decision Tree............................................................17
Section 2.1: Find a good/“best” model using Decision Tree without the Log Values....................18
Section 2.2: Find a good/“best” model using Decision Tree with Log values for Rev_Total &
Bal_Total....................................................................................................................................21
Section 3: Find a good/“best” model using ANOVA....................................................................22
Section 3.1: One-way ANOVA ....................................................................................................22
Section 3.2: Two-way ANOVA....................................................................................................24
Section 3.3: Evaluating model for equal variances ......................................................................26
Section 4: Running a Principal Component Analysis ..................................................................27
Section 5: LASSO........................................................................................................................29
Section 6: Running Clustering ....................................................................................................30
Section 7:Market Basket Analysis ..............................................................................................33
Section 8: Find a good/“best” model using multiple linear regression with the 5 principal
components .................................................................................................................................33
Section 8: Find a good/“best” model using multiple linear regression with the clusters...............34
Section 8.1:Model Improvement – Stepwise ...............................................................................36
Section 9: Find a good/“best” model using multiple linear regression with the clusters and the
original data................................................................................................................................36

Section 9.1:Model Improvement – Stepwise ...............................................................................38
Section 9.2:Model Improvement – Removing Variables.............................................................39
Section 10: Find a good/“best” model using multiple linear regression with the clusters and the 5
components .................................................................................................................................40
Section 10.1: Model Improvement- Stepwise...............................................................................41
Section 10.2: Model Improvement- Removing Variables.............................................................42
Stage 3: Model Comparison............................................................................................................43
Section 1: Creating a validation column......................................................................................43
Section 2: Saving Prediction Formulas for each model................................................................43
Section 3: Performing Model Comparison..................................................................................44
Model Comparison Summary .....................................................................................................45
References.......................................................................................................................................46

Overview
Bank management wants to understand how their customer banking habits contribute to revenues
and profitability. The data set BankRevnue.jmp has banking information of 7420 customers.
The variables in the data set are listed below. Management would like to have model to forecast
bank revenues to guide them in future marketing campaigns. A surrogate for customer’s
profitability that is available in the data set is the variable Rev_Total which is the total revenue a
customer generates through their accounts and transactions.
Objective:
In this paper, we will perform various statistical analysis to determine the best model that
predicts profitability for a given customer. A 360° approach is followed to perform, compare &
choose the best statistical analysis. We are performing our analysis in 3 major stages: In stage 1:
descriptive statistics, noticing correlations, exploring missing values & outliers; In stage 2:
Regression to have a baseline, ANOVA & then bring other tools like PCA, Clustering &
Decision Tree. Within each model we will try to further enhance our model by performing
stepwise, or/and removing least contributing variables or/and creating new variables. And in
stage 3 we will compare our models and choose the best to recommend our predictions.
Data:
There are 16 variablesinthe dataset.

Stage 1: Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. Together with simple graphics analysis,
they form the basis of virtually every quantitative analysis of data. (1)
Initial Findings:
Let’s inspect the data before running the various analyses.
1) Rev_Total: On average the available data for Rev_Total is not normally distributed as
evident by graph & the P value less than 0.05.
2) Bal_Total: On average the available data for Bal_Total is not normally distributed as
evident by graph & the P value less than 0.05.

3) Offer: There are 4063 members who received the promotional offer & have higher
probability w.r.t who doesn’t.
4) CHQ: The frequency for debit card account activity denotes that there is almost equal
probability of having low (or zero) vs greater account activity.
5) CARD: Again, for credit card activity too there is an equal chance of having low or zero
& greater account activities.

6) SAV 1: The primary savings account activity record suggests that more than 80% of
customers have low or zero account activity & almost 19% members have greater
account activity.
7) LOAN: The loan account activity record suggests that more than 88% of customers have
low or zero account activity & almost 11% members have greater account activity.
8) MORT: The data for mortgage suggests that 63% of customers are lower tier/less
important and 37% of customers are higher tier/more important to the bank’s portfolio.

9) INSUR: The insurance account activity record suggests that almost 70% of customers
have low or zero account activity & more than 30% members have greater account
activity.
10) PENS: The data for retirements savings suggests that 48% of customers are lower tier/less
important and 52% of customers are higher tier/more important to the bank’s portfolio.
11) Check: The checking account activity record suggests that more than 22% of customers
have low or zero account activity & almost 78% members have greater account activity.

12) CD: The data for certificate of deposit account tier suggests that 88% of customers are
lower tier/less important and 11% of customers are higher tier/more important to the
bank’s portfolio.
13) MM: The money market account activity record suggests that more than 70% of
customers have low or zero account activity & almost 30% members have greater
account activity.
14) AccountAge:The AccountAge denotesthe numberof yearsasa customerof the bank. On
average the AccountAge is 7 yearsand the minimumis0& maximumis26.

Note:There are 7420 records inthe Bank Revenue Dataset.The above table denotesthe statsforsome
of the importantvariables.
Taking Logs of Rev_Total& Bal_Total
Since the Rev_Total & Bal_Total were not normallydistributedwe are creatingnew columnsbytaking
logsof bothcolumns.

Correlations
Correlation is a measure of the linear association between two variables.
On the below figure some of the most important variable correlation suggests that Bal_Total has
more significance on revenue generated by the customer over a 6-month period.
From pairwise correlations we can notice the positive correlation of Bal_Total on Rev_Total.
Treating Missing Data & Outliers.
Note:There were nomissingvaluesfound.
Note:The Outliersanalysisdoesn’tresultanysignificantvalueto considerremoving.

Stage 2: Running Models
In the second stage we will run different combinations of model like Multiple Linear Regression,
ANOVA, Decision Tree, Principal Componnets, LASSO, Clustering & Market Basket. It may be
possible that some of the selected statistical analysis is not applicable to the given dataset. We
will make our best judgement and do the analysis and combine the newly created variables to run
the Multiple Linear Regression. We will also save the validation columns as well as the
Prediction formula column so that we could perform the model comparison analysis in the stage
3 of this paper.
Section1: Find a good/“best”model using multiple linear regression
Multiple linear regression attempts to model the relationship between two or more
explanatory variables and a response variable by fitting a linear equation to observed
data. Every value of the independent variable x is associated with a value of the
dependent variable y. The population regression line for p explanatory variables x1, x2,
... , xp is defined to be y = 0 + 1x1 + 2x2 + ... + pxp. This line describes how the
mean response y changes with the explanatory variables. The observed values
for y vary about their means y and are assumed to have the same standard
deviation . The fitted values b0, b1, ..., bp estimate the parameters 0, 1, ..., p of
the population regression line.
Since the observed values for y vary about their means y, the multiple regression
model includes a term for this variation. In words, the model is expressed as
DATA = FIT + RESIDUAL,
where the "FIT" term represents the expression 0 + 1x1 + 2x2 + ... pxp. The
"RESIDUAL" term represents the deviations of the observed values y from their
means y, which are normally distributed with mean 0 and variance . The
notation for the model deviations is .
Formally, the model for multiple linear regression, given n observations, is
yi = 0 + 1xi1 + 2xi2 + ... pxip + i for i = 1,2, ... n. (5)

Section 1.1: Multiple Linear Regressionusing all variables
In this section, we will perform the multiple linear regression using the dependent
variable ‘Rev_Total’ along with all the other independent variables. We will also
examine the effect summary report, Summary of fit, parameter estimates and adjust the
model.
1) In the Fig 1.1 the Effect Summary report lists in ascending p-value order the LogWorth
or False Discovery Rate (FDR) LogWorth values. These statistical values measure the
effects of the independent variables in the model. A LogWorth value greater than 2
corresponds to a p-value of less than .01, The FDR LogWorth is better statistic for
assessing significance since it adjusts the p-values to account for the false discovery rate
from multiple tests.
In the Fig 1.1 We can notice that the LogWorth value for top 4 variables Log Bal_Total,
CARD, Check & Offer are above 2 and the p-value is below .01.
Fig 1.1

2) Evaluating the statistical significance of the Model:
Fig 1.2
 In the above figure we can notice that according to Parameter Estimates table, the
multiple linear regression equation is as follows:
Rev_Total = -2.53 + 0.4421* Bal_Total -0.06 * Offer - 0.00057 * Age -0.004403 * CHQ
– 0.783241* CARD + 0.01095 * SAV1 + 0.0587 * Loan + 0.01605 * Mort – 0.03717 *
INSUR - 0.000349 * PENS + 0.686 * Check
 In fig 1.2 we can see that the p-value for the F test is <0.0001. So we reject the F
test and can conclude that one or more of the independent variables (that is one or
more of variables Bal_total, Offer, Age, Chq, Card, Sav1, Loan, Mort, Insur,
Pens, Check, CD, MM, Savings, and AccountAge) is significantly related to
Rev_Total.
 T test for each Independent Variable: We can see in figure 1.2 above that
Bal_Total, Offer, Card, Loan, Insur and Check each reject H0 and are
significantly related to Rev_Total above and beyond.

 Multicollinearity (VIF): It seems that most of the independent variables have no
significant multicollinearity as they fall below 5. Card, and Check have >5, high
VIFs, that suggests the presence of multicollinearity.
 Let’s examine the multiple linear regression further to obtain the base line values.
We find an adjusted R² = 0.5988 and an RMSE = 0.8105
RMSE is measure of the average deviation of the estimates from the observed
values or is the square root of the variance of the residuals. But R² is the fraction
of the total sum of squares that is explained by the regression.
As we need to do forecasting, we should consider RMSE here.
Section1.2:ModelImprovement Steps
Stepwise Regression
Fig 1.3
In the figure above we can notice that the RMSE has not changed much and the adjusted R² is
decreased. We will attempt to further improve the model by removing least contributing
variables. VIF is greater than 5 for Card.

Excluding Variables
Fig 1.4
Removingthe non-significantcolumnshave notsignificantlychangedthe R²& RMSE.
Section2: Find a good/“best”model using DecisionTree

A decision tree is a hierarchical collection of rules that specify how a data set is to be broken up
into smaller groups based on a target variable (dependent y variable). If the target variable is
categorical, then the decision tree is called a classification tree. If the target variable is
continuous, then the decision tree is called a regression tree.
Our dataset has mostly categorical data.
JMP automatically chooses the variable and split that maximize the LogWorth statistic.
Section2.1:Find a good/“best”model using DecisionTree without the Log
Values
Fig 2.1
In the above figure we can notice that the splits were made on Bal_Total. We splitted the tree on
best and kept on pruning the worst split.

Fig 2.2
In the above figure we can notice that the node for Bal_Total<$2 has been split into a leaf for
Bal_Total<$1. We will perform a couple of more splits to observe the changes in LogWorth
values.
Note: The logworth statistic is used for pruning or growing a tree. It is defined as the –log(p-
value). Typically, if the logworth is greater than 2, then the variable that is used in the branch is
significant and should be included in the tree.

Fig 2.3
In the above figure we cannotice that the LogWorth value forBal_Total<1 is2.84 and inthe next split
the LogWorth is1.21, and finallyafterPENS(1) the tree couldn’tbe furthersplitted.Hence we will prune
all the below2 Logworthsplits.
Fig 2.4

In the above figure, we can notice that the Bal_Total variable is the sole contributor.
Fig 2.4
In the above figure, we can notice that the Leaf report shows the mean and counts of the bottom-
level leaves.
Note: We have saved the prediction formula column to our dataset for model comparison in
stage 3.
Section2.2:Find a good/“best”model using DecisionTree with Log values for
Rev_Total& Bal_Total

Here in this section we can notice that taking log helps with improving the R² though RMSE also
increased.
Section3: Find a good/“best”model using ANOVA
Analysis of Variance (ANOVA) is a statistical method used to test differences between two or
more means. As the name suggests inferences about means are made by analyzing
variance. ANOVA is used to test general rather than specific differences among means.
Section3.1:One-wayANOVA
Performing a One-way ANOVA: In the previous sections we realized that the Bal_Total was
most contributing factor in Rev_Total. Hence for one-way ANOVA we changed the Bal_Total as
nominal and observed the results below:

Fig 3.1
Fig 3.2
In the above figure we can notice that the R² is 0.92, adjusted R² is 0.65 and RMSE is 1.89. The
F test here suggests that the null hypothesis is rejected since the P value is <0.01.
There’s not much to further do since the JMP is not showing results for Levine, Welch’s,or
Tukey’s HSD tests.

Fig 3.3 Performing a one-way ANOVA on Card & Offer
As insection1 we noticedthe significantimpactof nominal variablesCard&offer,we are further
analyzingtheirimpact.
Here alsowe will rejectthe null hypothesisforthe Ftest.
Section3.2:Two-wayANOVA
Performing a two-way ANOVA on Card & Offer

We can see here thatlogworthvalue of Cardis 1.783 whichissignificantlybetterthanothers.

Here we can notice thatfromconnecting letterreportthatlevelsnotconnectingbysame letterare
significantlydifferent.
Section3.3:Evaluating model for equal variances
An attempttoevaluate model by addingthe new column forconcatenatedcard& offerthoughthe one-
wayanova was notrunningonit as theirdata type wasnot compatible.

Section4: Running a Principal Component Analysis
The Principal Component analysis (PCA) is an exploratory multivariate technique with two
overall objectives. First objective is “dimension reduction” that is reducing a several variables
to a few with a minimum loss of information and second objective is to “discover the structure
in the relationships between the variables”.
Fig 4.1
Fig 4.2
The first eigenvalue, 4.0061 is larger than the second eigenvalue, 2.9642. This suggests that the
first principal component, Prin1, is much more important (in terms of explaining the variation in
the pair of variables) than the second principal component, Prin2.
Also, the bar for the percentage of variation accounted for by Prin 1 is about 25.038% & Prin 2 is
about 18.52%. The first five principal components have a cumulative percentage of 72.65%.
The scatterplot below is clustered towards different quadrants indicating a prospective
correlation between the principal components.

In the Loading plot the variables are nearly clustered into 4 groups.
In the Fig 4.2 the Eigenvectors table suggests the figures in darker shade is statistically
significant and may have an impact on Revenue. The Screeplot suggests that Prin 1 is highly
significant and the elbow on Prin 5 suggests that we should consider five principal components
for our analysis.
Fig 4.3
Fig 4.4

Fig 4.5
After inspecting Rotated factor loading, we found that the results are similar to score plot &
eigenvectors as the Insur, MM & the savings are the main driving factors of Principal component
1.
Section5: LASSO
The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum
of squared errors, with a bound on the sum of the absolute values of the coefficients. It has
connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and
boosting methods.
Fig 5.1

Fig 5.2
In the above figure we can notice that the CD, MM & saving have been removed as this is one of
the advantages of the LASSO that the variables often hit the zero axis in groups, enabling the
modeler to drop several variables at a time.
From the effect test above the most important variables are Loan, Bal_Total, Card & Insur.
Section6: Running Clustering
Cluster analysis is an exploratory multivariate technique designed to uncover natural groupings
of the rows in a data set. Clustering is helpful when we have more than 2 dimensions in a dataset.
Cluster analysis is a technique where no dependence in any of the variables is required. The
objective of the cluster analysis is to divide the data set into groups, where the observations
within each group are relatively homogeneous, and yet the groups are different than each other.
The initial dendrogram defaults to five clusters and the scree plot appears to have an elbow
somewhere between the fourth and fifth point. We examined the clusters briefly to obtain a
baseline for comparison.

Hierarchical Clustering
Fig 6.1
Fig 6.2
The initial dendrogram defaults to four clusters. We examined the clusters briefly to obtain a
baseline for comparison. Cluster 4 has the most observations i.e. 2816.
Cluster groupings are described in greater detail in the K Means section below since the
Hiearchical analysis is meant for initial inspection.

K Means
We also conducted K Means for 6 clusters. The standard biplot has overlaps which was further
confirmed by a closer look at the 3D biplot.
Fig 6.3
Fig 6.4
The K Means output shows that Cluster 1 could be a outlier with only one observation. The
clusters can be described as the following:
Cluster 1: Only 1 observation with the highest Rev_Total
Cluster 2: This cluster has highest Bal_Total
Cluster 3: It has 813 records

Cluster 5: It has the highest number of observations with 2801
The K means method is intended for use with larger data tables, from approximately 200 to
100,000 observations and the result can be highly sensitive to the order of the observations in the
data table.
Section7: MarketBasketAnalysis
Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items. It works by looking for combinations of items that occur together
frequently in transactions. To put it another way, it allows retailers to identify relationships
between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to
identify strong rules discovered in transaction data using measures of interestingness, based on
the concept of strong rules.
Note: Due to lack of Customer ID or any sort of ID it was difficult to run the Market Basket
Analysis.
Section8: Find a good/“best”model using multiple linear regressionwith the
5 principal components
In this section, we will run the multiple linear regression with the 5 principal components that we
got from Section 4. Here we will notice the predicted plots, LogWorth, PValue, and R square
value.
The term leverage is used because these plots help you visualize the influence of points on the
test for including the effect in the model. A point that is horizontally distant from the center of
the plot exerts more influence on the effect test than does a point that is close to the center.
 First ran the Multiple Linear Regression with 5 components
Received R² = 0.661385 and an RMSE = 1.866957
 Then ran the stepwise analysis Multiple Linear Regression with 5 components, though
noticed it automatically removed Prin 2.
 Next to further improve the model removed Prin1 & ran the stepwise analysis Multiple
Linear Regression with 4 components. As earlier Prin2 was automatically removed.
We can see that all models thus far produce approximately the same results.

Section9: Find a good/“best”model using multiple linear regressionwith the
clusters
Afterrunningthe Multiple LinearRegressionwithclusterswe gotthe R² of 0.223874 & the RMSE of
1.126958. Most of the clustersare statisticallysignificantatthe significance level of .05level andalmost
all the VIFsare greaterthan 5.

Section9.1:ModelImprovement – Stepwise
We will nowrerunthe model asa stepwise tofurtherimprovethe model.
Here in the above figure we cannotice that the VIFhas now decreased andall the VIFsare nearly
around5. Thoughthere’snotmuchchangedin the R² and RMSE value.
Section10:Find a good/“best” modelusing multiple linear regressionwith the
clusters and the original data
In thismodel we ranthe multiple linearregressionwiththe clustersandthe original data.We can notice
that ineffectsummary withinfig9.1,many of the variableslike CARD,AGEetcare below 2 logworth
values.
The R² is0.686 and the RMSE is0.7172.
Withinthe parameterestimateswe cannotice thatthe VIFsformany variablesare greaterthan5. And
fewof the variablesare statisticallynonsignificant.

Fig 9.1

Section10.1:Model Improvement – Stepwise
Fig 9.2
In the above figure we cannotice that the statisticallynon-significantvariablesare removed
automatically.Thoughthere’snotmuchchangedwiththe R² and the RMSE. Logworthis goodfor all the
selectedvariables.Manyof the VIF’sare still veryhigh.

Section10.2:Model Improvement – Removing Variables
We will remove the INSRandCDas theyhave VIFs. The R² is decreasedandthe RMSE is now increased.
Fig 9.3

Section11:Find a good/“best” modelusing multiple linear regressionwith the
clusters and the 5 components
Fig 10.1

We ran the multiple linearregressionwiththe clustersandthe 5 components.We noticedthatPrin1
has Logworthvalue below2.The VIFsare highforclusters& the Prin3. We will furtherattemptto
improve the model.
Section11.1:Model Improvement- Stepwise
Fig 10.2
Here we can notice thatthe R² value isdecreasedandthe RMSE isincreased thoughthe LogWorthvalue
for eachvariable isabove 2. VIFforclustersand the Prin1and Prin3 are greaterthan5.

Section11.2:Model Improvement- Removing Variables
Fig 10.3
Here we can notice that the R² value is decreased and the RMSE is increased though the
LogWorth value for each variable is above 2. VIF for clusters are greater than 5.

Stage 3: Model Comparison
Section1: Creating a validation column
Fig 1
Section2: Saving PredictionFormulas for eachmodel
While running 10 different combination of models in stage 2 we saved prediction formula to
compare our models.
Fig 2

Section3: Performing ModelComparison
Fig 3
Fig 4
The above figure givesthe snapshotof model comparisonbyvalidation set and the predicted formulas.

ModelComparison Summary
In thissectionwe will summarize ourstatistical analysis,asevidentfromthe above summarytable the
Section10 is the bestmodel withthe R² = 0.686157, Adj.R² = 0.685352 andthe RMSE = 0.717263. An
attemptto furtherimprove the model bydoingstepwise andnon-significantvariableremoval wasnot
helpful inimprovingthe R²or RMSE. Thoughthere wassome improvement inthe LogWorth&VIFs.

References
1. Descriptive Statistics: https://socialresearchmethods.net/kb/statdesc.php
2. HypothesisTesting: http://mathworld.wolfram.com/HypothesisTesting.html
3. Testconcerningthe meanof Normal Population:
http://demonstrations.wolfram.com/ThePowerOfATestConcerningTheMeanOfANormalPopulati
on/
4. AssessingNormality: https://www.jmp.com/en_hk/learning-library/probabilities-and-
distributions.html
5. AnalysisToolpak: https://support.office.com/en-us/article/use-the-analysis-toolpak-to-perform-
complex-data-analysis-6c67ccf0-f4a9-487c-8dec-
bdb5a2cefab6?NS=EXCEL&Version=90&SysLcid=1033&UiLcid=1033&AppVer=ZXL900&HelpId=xl
addin.chm1786&ui=en-US&rs=en-US&ad=US
6. Correlation:https://www.jmp.com/content/dam/jmp/documents/en/academic/learning-
library/05-correlation.pdf
7. RMSE & R²: https://www.quora.com/What-is-the-difference-between-RMSE-and-R-squared-in-
statistics
8. ANOVA:http://onlinestatbook.com/2/analysis_of_variance/intro.html
9. LASSO:http://statweb.stanford.edu/~tibs/lasso.html
10. Market BasketAnalysis: https://towardsdatascience.com/a-gentle-introduction-on-market-
basket-analysis-association-rules-fa4b986a40ce
11. Market BasketAnalysis: http://databoosting.com/boosting-revenue-market-basket-analysis/

Akanksha final case_a_bank revenues

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Akanksha final case_a_bank revenues