SlideShare a Scribd company logo
Final Case A: Bank
Revenues
Quest to best model that predicts profitability for a given customer
Akanksha Sinha 10/20/19 Introduction to Data Mining
Final Case A:Bank Revenues-PredictiveStatistical Analysis 1
AkankshaSinha© 2019
Profitability Prediction Statistical Analysis
Assignment Name: Final Case A| Dataset: Bank Revenues
DSS 660 | Fall 2019 | Under supervision of Prof. Dr. Ronald Klimberg
Submitted by Akanksha Sinha
This Photo byUnknown Author is licensedunder CCBY-NC
Final Case A:Bank Revenues-PredictiveStatistical Analysis 2
AkankshaSinha© 2019
Contents
Overview...........................................................................................................................................4
Objective:......................................................................................................................................4
Data:.............................................................................................................................................4
Stage 1: Descriptive Statistics ...........................................................................................................5
Initial Findings:............................................................................................................................5
Taking Logs of Rev_Total & Bal_Total.......................................................................................10
....................................................................................................................................................10
Correlations ................................................................................................................................11
Treating Missing Data & Outliers...............................................................................................11
Stage 2: Running Models ................................................................................................................13
Section 1: Find a good/“best” model using multiple linear regression.........................................13
Section 1.1:Multiple Linear Regression using all variables.........................................................14
Section 1.2:Model Improvement Steps.......................................................................................16
Section 2: Find a good/“best” model using Decision Tree............................................................17
Section 2.1: Find a good/“best” model using Decision Tree without the Log Values....................18
Section 2.2: Find a good/“best” model using Decision Tree with Log values for Rev_Total &
Bal_Total....................................................................................................................................21
Section 3: Find a good/“best” model using ANOVA....................................................................22
Section 3.1: One-way ANOVA ....................................................................................................22
Section 3.2: Two-way ANOVA....................................................................................................24
Section 3.3: Evaluating model for equal variances ......................................................................26
Section 4: Running a Principal Component Analysis ..................................................................27
Section 5: LASSO........................................................................................................................29
Section 6: Running Clustering ....................................................................................................30
Section 7:Market Basket Analysis ..............................................................................................33
Section 8: Find a good/“best” model using multiple linear regression with the 5 principal
components .................................................................................................................................33
Section 8: Find a good/“best” model using multiple linear regression with the clusters...............34
Section 8.1:Model Improvement – Stepwise ...............................................................................36
Section 9: Find a good/“best” model using multiple linear regression with the clusters and the
original data................................................................................................................................36
Final Case A:Bank Revenues-PredictiveStatistical Analysis 3
AkankshaSinha© 2019
Section 9.1:Model Improvement – Stepwise ...............................................................................38
Section 9.2:Model Improvement – Removing Variables.............................................................39
Section 10: Find a good/“best” model using multiple linear regression with the clusters and the 5
components .................................................................................................................................40
Section 10.1: Model Improvement- Stepwise...............................................................................41
Section 10.2: Model Improvement- Removing Variables.............................................................42
Stage 3: Model Comparison............................................................................................................43
Section 1: Creating a validation column......................................................................................43
Section 2: Saving Prediction Formulas for each model................................................................43
Section 3: Performing Model Comparison..................................................................................44
Model Comparison Summary .....................................................................................................45
References.......................................................................................................................................46
Final Case A:Bank Revenues-PredictiveStatistical Analysis 4
AkankshaSinha© 2019
Overview
Bank management wants to understand how their customer banking habits contribute to revenues
and profitability. The data set BankRevnue.jmp has banking information of 7420 customers.
The variables in the data set are listed below. Management would like to have model to forecast
bank revenues to guide them in future marketing campaigns. A surrogate for customer’s
profitability that is available in the data set is the variable Rev_Total which is the total revenue a
customer generates through their accounts and transactions.
Objective:
In this paper, we will perform various statistical analysis to determine the best model that
predicts profitability for a given customer. A 360° approach is followed to perform, compare &
choose the best statistical analysis. We are performing our analysis in 3 major stages: In stage 1:
descriptive statistics, noticing correlations, exploring missing values & outliers; In stage 2:
Regression to have a baseline, ANOVA & then bring other tools like PCA, Clustering &
Decision Tree. Within each model we will try to further enhance our model by performing
stepwise, or/and removing least contributing variables or/and creating new variables. And in
stage 3 we will compare our models and choose the best to recommend our predictions.
Data:
There are 16 variablesinthe dataset.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 5
AkankshaSinha© 2019
Stage 1: Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. Together with simple graphics analysis,
they form the basis of virtually every quantitative analysis of data. (1)
Initial Findings:
Let’s inspect the data before running the various analyses.
1) Rev_Total: On average the available data for Rev_Total is not normally distributed as
evident by graph & the P value less than 0.05.
2) Bal_Total: On average the available data for Bal_Total is not normally distributed as
evident by graph & the P value less than 0.05.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 6
AkankshaSinha© 2019
3) Offer: There are 4063 members who received the promotional offer & have higher
probability w.r.t who doesn’t.
4) CHQ: The frequency for debit card account activity denotes that there is almost equal
probability of having low (or zero) vs greater account activity.
5) CARD: Again, for credit card activity too there is an equal chance of having low or zero
& greater account activities.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 7
AkankshaSinha© 2019
6) SAV 1: The primary savings account activity record suggests that more than 80% of
customers have low or zero account activity & almost 19% members have greater
account activity.
7) LOAN: The loan account activity record suggests that more than 88% of customers have
low or zero account activity & almost 11% members have greater account activity.
8) MORT: The data for mortgage suggests that 63% of customers are lower tier/less
important and 37% of customers are higher tier/more important to the bank’s portfolio.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 8
AkankshaSinha© 2019
9) INSUR: The insurance account activity record suggests that almost 70% of customers
have low or zero account activity & more than 30% members have greater account
activity.
10) PENS: The data for retirements savings suggests that 48% of customers are lower tier/less
important and 52% of customers are higher tier/more important to the bank’s portfolio.
11) Check: The checking account activity record suggests that more than 22% of customers
have low or zero account activity & almost 78% members have greater account activity.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 9
AkankshaSinha© 2019
12) CD: The data for certificate of deposit account tier suggests that 88% of customers are
lower tier/less important and 11% of customers are higher tier/more important to the
bank’s portfolio.
13) MM: The money market account activity record suggests that more than 70% of
customers have low or zero account activity & almost 30% members have greater
account activity.
14) AccountAge:The AccountAge denotesthe numberof yearsasa customerof the bank. On
average the AccountAge is 7 yearsand the minimumis0& maximumis26.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 10
AkankshaSinha© 2019
Note:There are 7420 records inthe Bank Revenue Dataset.The above table denotesthe statsforsome
of the importantvariables.
Taking Logs of Rev_Total& Bal_Total
Since the Rev_Total & Bal_Total were not normallydistributedwe are creatingnew columnsbytaking
logsof bothcolumns.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 11
AkankshaSinha© 2019
Correlations
Correlation is a measure of the linear association between two variables.
On the below figure some of the most important variable correlation suggests that Bal_Total has
more significance on revenue generated by the customer over a 6-month period.
From pairwise correlations we can notice the positive correlation of Bal_Total on Rev_Total.
Treating Missing Data & Outliers.
Note:There were nomissingvaluesfound.
Note:The Outliersanalysisdoesn’tresultanysignificantvalueto considerremoving.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 12
AkankshaSinha© 2019
Final Case A:Bank Revenues-PredictiveStatistical Analysis 13
AkankshaSinha© 2019
Stage 2: Running Models
In the second stage we will run different combinations of model like Multiple Linear Regression,
ANOVA, Decision Tree, Principal Componnets, LASSO, Clustering & Market Basket. It may be
possible that some of the selected statistical analysis is not applicable to the given dataset. We
will make our best judgement and do the analysis and combine the newly created variables to run
the Multiple Linear Regression. We will also save the validation columns as well as the
Prediction formula column so that we could perform the model comparison analysis in the stage
3 of this paper.
Section1: Find a good/“best”model using multiple linear regression
Multiple linear regression attempts to model the relationship between two or more
explanatory variables and a response variable by fitting a linear equation to observed
data. Every value of the independent variable x is associated with a value of the
dependent variable y. The population regression line for p explanatory variables x1, x2,
... , xp is defined to be y = 0 + 1x1 + 2x2 + ... + pxp. This line describes how the
mean response y changes with the explanatory variables. The observed values
for y vary about their means y and are assumed to have the same standard
deviation . The fitted values b0, b1, ..., bp estimate the parameters 0, 1, ..., p of
the population regression line.
Since the observed values for y vary about their means y, the multiple regression
model includes a term for this variation. In words, the model is expressed as
DATA = FIT + RESIDUAL,
where the "FIT" term represents the expression 0 + 1x1 + 2x2 + ... pxp. The
"RESIDUAL" term represents the deviations of the observed values y from their
means y, which are normally distributed with mean 0 and variance . The
notation for the model deviations is .
Formally, the model for multiple linear regression, given n observations, is
yi = 0 + 1xi1 + 2xi2 + ... pxip + i for i = 1,2, ... n. (5)
Final Case A:Bank Revenues-PredictiveStatistical Analysis 14
AkankshaSinha© 2019
Section 1.1: Multiple Linear Regressionusing all variables
In this section, we will perform the multiple linear regression using the dependent
variable ‘Rev_Total’ along with all the other independent variables. We will also
examine the effect summary report, Summary of fit, parameter estimates and adjust the
model.
1) In the Fig 1.1 the Effect Summary report lists in ascending p-value order the LogWorth
or False Discovery Rate (FDR) LogWorth values. These statistical values measure the
effects of the independent variables in the model. A LogWorth value greater than 2
corresponds to a p-value of less than .01, The FDR LogWorth is better statistic for
assessing significance since it adjusts the p-values to account for the false discovery rate
from multiple tests.
In the Fig 1.1 We can notice that the LogWorth value for top 4 variables Log Bal_Total,
CARD, Check & Offer are above 2 and the p-value is below .01.
Fig 1.1
Final Case A:Bank Revenues-PredictiveStatistical Analysis 15
AkankshaSinha© 2019
2) Evaluating the statistical significance of the Model:
Fig 1.2
 In the above figure we can notice that according to Parameter Estimates table, the
multiple linear regression equation is as follows:
Rev_Total = -2.53 + 0.4421* Bal_Total -0.06 * Offer - 0.00057 * Age -0.004403 * CHQ
– 0.783241* CARD + 0.01095 * SAV1 + 0.0587 * Loan + 0.01605 * Mort – 0.03717 *
INSUR - 0.000349 * PENS + 0.686 * Check
 In fig 1.2 we can see that the p-value for the F test is <0.0001. So we reject the F
test and can conclude that one or more of the independent variables (that is one or
more of variables Bal_total, Offer, Age, Chq, Card, Sav1, Loan, Mort, Insur,
Pens, Check, CD, MM, Savings, and AccountAge) is significantly related to
Rev_Total.
 T test for each Independent Variable: We can see in figure 1.2 above that
Bal_Total, Offer, Card, Loan, Insur and Check each reject H0 and are
significantly related to Rev_Total above and beyond.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 16
AkankshaSinha© 2019
 Multicollinearity (VIF): It seems that most of the independent variables have no
significant multicollinearity as they fall below 5. Card, and Check have >5, high
VIFs, that suggests the presence of multicollinearity.
 Let’s examine the multiple linear regression further to obtain the base line values.
We find an adjusted R² = 0.5988 and an RMSE = 0.8105
RMSE is measure of the average deviation of the estimates from the observed
values or is the square root of the variance of the residuals. But R² is the fraction
of the total sum of squares that is explained by the regression.
As we need to do forecasting, we should consider RMSE here.
Section1.2:ModelImprovement Steps
Stepwise Regression
Fig 1.3
In the figure above we can notice that the RMSE has not changed much and the adjusted R² is
decreased. We will attempt to further improve the model by removing least contributing
variables. VIF is greater than 5 for Card.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 17
AkankshaSinha© 2019
Excluding Variables
Fig 1.4
Removingthe non-significantcolumnshave notsignificantlychangedthe R²& RMSE.
Section2: Find a good/“best”model using DecisionTree
Final Case A:Bank Revenues-PredictiveStatistical Analysis 18
AkankshaSinha© 2019
A decision tree is a hierarchical collection of rules that specify how a data set is to be broken up
into smaller groups based on a target variable (dependent y variable). If the target variable is
categorical, then the decision tree is called a classification tree. If the target variable is
continuous, then the decision tree is called a regression tree.
Our dataset has mostly categorical data.
JMP automatically chooses the variable and split that maximize the LogWorth statistic.
Section2.1:Find a good/“best”model using DecisionTree without the Log
Values
Fig 2.1
In the above figure we can notice that the splits were made on Bal_Total. We splitted the tree on
best and kept on pruning the worst split.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 19
AkankshaSinha© 2019
Fig 2.2
In the above figure we can notice that the node for Bal_Total<$2 has been split into a leaf for
Bal_Total<$1. We will perform a couple of more splits to observe the changes in LogWorth
values.
Note: The logworth statistic is used for pruning or growing a tree. It is defined as the –log(p-
value). Typically, if the logworth is greater than 2, then the variable that is used in the branch is
significant and should be included in the tree.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 20
AkankshaSinha© 2019
Fig 2.3
In the above figure we cannotice that the LogWorth value forBal_Total<1 is2.84 and inthe next split
the LogWorth is1.21, and finallyafterPENS(1) the tree couldn’tbe furthersplitted.Hence we will prune
all the below2 Logworthsplits.
Fig 2.4
Final Case A:Bank Revenues-PredictiveStatistical Analysis 21
AkankshaSinha© 2019
In the above figure, we can notice that the Bal_Total variable is the sole contributor.
Fig 2.4
In the above figure, we can notice that the Leaf report shows the mean and counts of the bottom-
level leaves.
Note: We have saved the prediction formula column to our dataset for model comparison in
stage 3.
Section2.2:Find a good/“best”model using DecisionTree with Log values for
Rev_Total& Bal_Total
Final Case A:Bank Revenues-PredictiveStatistical Analysis 22
AkankshaSinha© 2019
Here in this section we can notice that taking log helps with improving the R² though RMSE also
increased.
Section3: Find a good/“best”model using ANOVA
Analysis of Variance (ANOVA) is a statistical method used to test differences between two or
more means. As the name suggests inferences about means are made by analyzing
variance. ANOVA is used to test general rather than specific differences among means.
Section3.1:One-wayANOVA
Performing a One-way ANOVA: In the previous sections we realized that the Bal_Total was
most contributing factor in Rev_Total. Hence for one-way ANOVA we changed the Bal_Total as
nominal and observed the results below:
Final Case A:Bank Revenues-PredictiveStatistical Analysis 23
AkankshaSinha© 2019
Fig 3.1
Fig 3.2
In the above figure we can notice that the R² is 0.92, adjusted R² is 0.65 and RMSE is 1.89. The
F test here suggests that the null hypothesis is rejected since the P value is <0.01.
There’s not much to further do since the JMP is not showing results for Levine, Welch’s,or
Tukey’s HSD tests.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 24
AkankshaSinha© 2019
Fig 3.3 Performing a one-way ANOVA on Card & Offer
As insection1 we noticedthe significantimpactof nominal variablesCard&offer,we are further
analyzingtheirimpact.
Here alsowe will rejectthe null hypothesisforthe Ftest.
Section3.2:Two-wayANOVA
Performing a two-way ANOVA on Card & Offer
Final Case A:Bank Revenues-PredictiveStatistical Analysis 25
AkankshaSinha© 2019
We can see here thatlogworthvalue of Cardis 1.783 whichissignificantlybetterthanothers.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 26
AkankshaSinha© 2019
Here we can notice thatfromconnecting letterreportthatlevelsnotconnectingbysame letterare
significantlydifferent.
Section3.3:Evaluating model for equal variances
An attempttoevaluate model by addingthe new column forconcatenatedcard& offerthoughthe one-
wayanova was notrunningonit as theirdata type wasnot compatible.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 27
AkankshaSinha© 2019
Section4: Running a Principal Component Analysis
The Principal Component analysis (PCA) is an exploratory multivariate technique with two
overall objectives. First objective is “dimension reduction” that is reducing a several variables
to a few with a minimum loss of information and second objective is to “discover the structure
in the relationships between the variables”.
Fig 4.1
Fig 4.2
The first eigenvalue, 4.0061 is larger than the second eigenvalue, 2.9642. This suggests that the
first principal component, Prin1, is much more important (in terms of explaining the variation in
the pair of variables) than the second principal component, Prin2.
Also, the bar for the percentage of variation accounted for by Prin 1 is about 25.038% & Prin 2 is
about 18.52%. The first five principal components have a cumulative percentage of 72.65%.
The scatterplot below is clustered towards different quadrants indicating a prospective
correlation between the principal components.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 28
AkankshaSinha© 2019
In the Loading plot the variables are nearly clustered into 4 groups.
In the Fig 4.2 the Eigenvectors table suggests the figures in darker shade is statistically
significant and may have an impact on Revenue. The Screeplot suggests that Prin 1 is highly
significant and the elbow on Prin 5 suggests that we should consider five principal components
for our analysis.
Fig 4.3
Fig 4.4
Final Case A:Bank Revenues-PredictiveStatistical Analysis 29
AkankshaSinha© 2019
Fig 4.5
After inspecting Rotated factor loading, we found that the results are similar to score plot &
eigenvectors as the Insur, MM & the savings are the main driving factors of Principal component
1.
Section5: LASSO
The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum
of squared errors, with a bound on the sum of the absolute values of the coefficients. It has
connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and
boosting methods.
Fig 5.1
Final Case A:Bank Revenues-PredictiveStatistical Analysis 30
AkankshaSinha© 2019
Fig 5.2
In the above figure we can notice that the CD, MM & saving have been removed as this is one of
the advantages of the LASSO that the variables often hit the zero axis in groups, enabling the
modeler to drop several variables at a time.
From the effect test above the most important variables are Loan, Bal_Total, Card & Insur.
Section6: Running Clustering
Cluster analysis is an exploratory multivariate technique designed to uncover natural groupings
of the rows in a data set. Clustering is helpful when we have more than 2 dimensions in a dataset.
Cluster analysis is a technique where no dependence in any of the variables is required. The
objective of the cluster analysis is to divide the data set into groups, where the observations
within each group are relatively homogeneous, and yet the groups are different than each other.
The initial dendrogram defaults to five clusters and the scree plot appears to have an elbow
somewhere between the fourth and fifth point. We examined the clusters briefly to obtain a
baseline for comparison.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 31
AkankshaSinha© 2019
Hierarchical Clustering
Fig 6.1
Fig 6.2
The initial dendrogram defaults to four clusters. We examined the clusters briefly to obtain a
baseline for comparison. Cluster 4 has the most observations i.e. 2816.
Cluster groupings are described in greater detail in the K Means section below since the
Hiearchical analysis is meant for initial inspection.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 32
AkankshaSinha© 2019
K Means
We also conducted K Means for 6 clusters. The standard biplot has overlaps which was further
confirmed by a closer look at the 3D biplot.
Fig 6.3
Fig 6.4
The K Means output shows that Cluster 1 could be a outlier with only one observation. The
clusters can be described as the following:
Cluster 1: Only 1 observation with the highest Rev_Total
Cluster 2: This cluster has highest Bal_Total
Cluster 3: It has 813 records
Final Case A:Bank Revenues-PredictiveStatistical Analysis 33
AkankshaSinha© 2019
Cluster 4: It has 101 records
Cluster 5: It has the highest number of observations with 2801
Cluster 6: It has 1915 records
The K means method is intended for use with larger data tables, from approximately 200 to
100,000 observations and the result can be highly sensitive to the order of the observations in the
data table.
Section7: MarketBasketAnalysis
Market Basket Analysis is one of the key techniques used by large retailers to uncover
associations between items. It works by looking for combinations of items that occur together
frequently in transactions. To put it another way, it allows retailers to identify relationships
between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data, and are intended to
identify strong rules discovered in transaction data using measures of interestingness, based on
the concept of strong rules.
Note: Due to lack of Customer ID or any sort of ID it was difficult to run the Market Basket
Analysis.
Section8: Find a good/“best”model using multiple linear regressionwith the
5 principal components
In this section, we will run the multiple linear regression with the 5 principal components that we
got from Section 4. Here we will notice the predicted plots, LogWorth, PValue, and R square
value.
The term leverage is used because these plots help you visualize the influence of points on the
test for including the effect in the model. A point that is horizontally distant from the center of
the plot exerts more influence on the effect test than does a point that is close to the center.
 First ran the Multiple Linear Regression with 5 components
Received R² = 0.661385 and an RMSE = 1.866957
 Then ran the stepwise analysis Multiple Linear Regression with 5 components, though
noticed it automatically removed Prin 2.
Received R² = 0.661325 and an RMSE = 1.86696
 Next to further improve the model removed Prin1 & ran the stepwise analysis Multiple
Linear Regression with 4 components. As earlier Prin2 was automatically removed.
Received R² = 0.65757 and an RMSE = 1.87719
We can see that all models thus far produce approximately the same results.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 34
AkankshaSinha© 2019
Section9: Find a good/“best”model using multiple linear regressionwith the
clusters
Afterrunningthe Multiple LinearRegressionwithclusterswe gotthe R² of 0.223874 & the RMSE of
1.126958. Most of the clustersare statisticallysignificantatthe significance level of .05level andalmost
all the VIFsare greaterthan 5.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 35
AkankshaSinha© 2019
Final Case A:Bank Revenues-PredictiveStatistical Analysis 36
AkankshaSinha© 2019
Section9.1:ModelImprovement – Stepwise
We will nowrerunthe model asa stepwise tofurtherimprovethe model.
Here in the above figure we cannotice that the VIFhas now decreased andall the VIFsare nearly
around5. Thoughthere’snotmuchchangedin the R² and RMSE value.
Section10:Find a good/“best” modelusing multiple linear regressionwith the
clusters and the original data
In thismodel we ranthe multiple linearregressionwiththe clustersandthe original data.We can notice
that ineffectsummary withinfig9.1,many of the variableslike CARD,AGEetcare below 2 logworth
values.
The R² is0.686 and the RMSE is0.7172.
Withinthe parameterestimateswe cannotice thatthe VIFsformany variablesare greaterthan5. And
fewof the variablesare statisticallynonsignificant.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 37
AkankshaSinha© 2019
Fig 9.1
Final Case A:Bank Revenues-PredictiveStatistical Analysis 38
AkankshaSinha© 2019
Section10.1:Model Improvement – Stepwise
Fig 9.2
In the above figure we cannotice that the statisticallynon-significantvariablesare removed
automatically.Thoughthere’snotmuchchangedwiththe R² and the RMSE. Logworthis goodfor all the
selectedvariables.Manyof the VIF’sare still veryhigh.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 39
AkankshaSinha© 2019
Section10.2:Model Improvement – Removing Variables
We will remove the INSRandCDas theyhave VIFs. The R² is decreasedandthe RMSE is now increased.
Fig 9.3
Final Case A:Bank Revenues-PredictiveStatistical Analysis 40
AkankshaSinha© 2019
Section11:Find a good/“best” modelusing multiple linear regressionwith the
clusters and the 5 components
Fig 10.1
Final Case A:Bank Revenues-PredictiveStatistical Analysis 41
AkankshaSinha© 2019
We ran the multiple linearregressionwiththe clustersandthe 5 components.We noticedthatPrin1
has Logworthvalue below2.The VIFsare highforclusters& the Prin3. We will furtherattemptto
improve the model.
Section11.1:Model Improvement- Stepwise
Fig 10.2
Here we can notice thatthe R² value isdecreasedandthe RMSE isincreased thoughthe LogWorthvalue
for eachvariable isabove 2. VIFforclustersand the Prin1and Prin3 are greaterthan5.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 42
AkankshaSinha© 2019
Section11.2:Model Improvement- Removing Variables
Fig 10.3
Here we can notice that the R² value is decreased and the RMSE is increased though the
LogWorth value for each variable is above 2. VIF for clusters are greater than 5.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 43
AkankshaSinha© 2019
Stage 3: Model Comparison
Section1: Creating a validation column
Fig 1
Section2: Saving PredictionFormulas for eachmodel
While running 10 different combination of models in stage 2 we saved prediction formula to
compare our models.
Fig 2
Final Case A:Bank Revenues-PredictiveStatistical Analysis 44
AkankshaSinha© 2019
Section3: Performing ModelComparison
Fig 3
Fig 4
The above figure givesthe snapshotof model comparisonbyvalidation set and the predicted formulas.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 45
AkankshaSinha© 2019
ModelComparison Summary
In thissectionwe will summarize ourstatistical analysis,asevidentfromthe above summarytable the
Section10 is the bestmodel withthe R² = 0.686157, Adj.R² = 0.685352 andthe RMSE = 0.717263. An
attemptto furtherimprove the model bydoingstepwise andnon-significantvariableremoval wasnot
helpful inimprovingthe R²or RMSE. Thoughthere wassome improvement inthe LogWorth&VIFs.
Final Case A:Bank Revenues-PredictiveStatistical Analysis 46
AkankshaSinha© 2019
References
1. Descriptive Statistics: https://socialresearchmethods.net/kb/statdesc.php
2. HypothesisTesting: http://mathworld.wolfram.com/HypothesisTesting.html
3. Testconcerningthe meanof Normal Population:
http://demonstrations.wolfram.com/ThePowerOfATestConcerningTheMeanOfANormalPopulati
on/
4. AssessingNormality: https://www.jmp.com/en_hk/learning-library/probabilities-and-
distributions.html
5. AnalysisToolpak: https://support.office.com/en-us/article/use-the-analysis-toolpak-to-perform-
complex-data-analysis-6c67ccf0-f4a9-487c-8dec-
bdb5a2cefab6?NS=EXCEL&Version=90&SysLcid=1033&UiLcid=1033&AppVer=ZXL900&HelpId=xl
addin.chm1786&ui=en-US&rs=en-US&ad=US
6. Correlation:https://www.jmp.com/content/dam/jmp/documents/en/academic/learning-
library/05-correlation.pdf
7. RMSE & R²: https://www.quora.com/What-is-the-difference-between-RMSE-and-R-squared-in-
statistics
8. ANOVA:http://onlinestatbook.com/2/analysis_of_variance/intro.html
9. LASSO:http://statweb.stanford.edu/~tibs/lasso.html
10. Market BasketAnalysis: https://towardsdatascience.com/a-gentle-introduction-on-market-
basket-analysis-association-rules-fa4b986a40ce
11. Market BasketAnalysis: http://databoosting.com/boosting-revenue-market-basket-analysis/

More Related Content

Recently uploaded

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 

Recently uploaded (20)

The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Christy Abraham Joy
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
Vit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
MindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
GetSmarter
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
Project for Public Spaces & National Center for Biking and Walking
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
Erica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Saba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
Simplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Akanksha final case_a_bank revenues

  • 1. Final Case A: Bank Revenues Quest to best model that predicts profitability for a given customer Akanksha Sinha 10/20/19 Introduction to Data Mining
  • 2. Final Case A:Bank Revenues-PredictiveStatistical Analysis 1 AkankshaSinha© 2019 Profitability Prediction Statistical Analysis Assignment Name: Final Case A| Dataset: Bank Revenues DSS 660 | Fall 2019 | Under supervision of Prof. Dr. Ronald Klimberg Submitted by Akanksha Sinha This Photo byUnknown Author is licensedunder CCBY-NC
  • 3. Final Case A:Bank Revenues-PredictiveStatistical Analysis 2 AkankshaSinha© 2019 Contents Overview...........................................................................................................................................4 Objective:......................................................................................................................................4 Data:.............................................................................................................................................4 Stage 1: Descriptive Statistics ...........................................................................................................5 Initial Findings:............................................................................................................................5 Taking Logs of Rev_Total & Bal_Total.......................................................................................10 ....................................................................................................................................................10 Correlations ................................................................................................................................11 Treating Missing Data & Outliers...............................................................................................11 Stage 2: Running Models ................................................................................................................13 Section 1: Find a good/“best” model using multiple linear regression.........................................13 Section 1.1:Multiple Linear Regression using all variables.........................................................14 Section 1.2:Model Improvement Steps.......................................................................................16 Section 2: Find a good/“best” model using Decision Tree............................................................17 Section 2.1: Find a good/“best” model using Decision Tree without the Log Values....................18 Section 2.2: Find a good/“best” model using Decision Tree with Log values for Rev_Total & Bal_Total....................................................................................................................................21 Section 3: Find a good/“best” model using ANOVA....................................................................22 Section 3.1: One-way ANOVA ....................................................................................................22 Section 3.2: Two-way ANOVA....................................................................................................24 Section 3.3: Evaluating model for equal variances ......................................................................26 Section 4: Running a Principal Component Analysis ..................................................................27 Section 5: LASSO........................................................................................................................29 Section 6: Running Clustering ....................................................................................................30 Section 7:Market Basket Analysis ..............................................................................................33 Section 8: Find a good/“best” model using multiple linear regression with the 5 principal components .................................................................................................................................33 Section 8: Find a good/“best” model using multiple linear regression with the clusters...............34 Section 8.1:Model Improvement – Stepwise ...............................................................................36 Section 9: Find a good/“best” model using multiple linear regression with the clusters and the original data................................................................................................................................36
  • 4. Final Case A:Bank Revenues-PredictiveStatistical Analysis 3 AkankshaSinha© 2019 Section 9.1:Model Improvement – Stepwise ...............................................................................38 Section 9.2:Model Improvement – Removing Variables.............................................................39 Section 10: Find a good/“best” model using multiple linear regression with the clusters and the 5 components .................................................................................................................................40 Section 10.1: Model Improvement- Stepwise...............................................................................41 Section 10.2: Model Improvement- Removing Variables.............................................................42 Stage 3: Model Comparison............................................................................................................43 Section 1: Creating a validation column......................................................................................43 Section 2: Saving Prediction Formulas for each model................................................................43 Section 3: Performing Model Comparison..................................................................................44 Model Comparison Summary .....................................................................................................45 References.......................................................................................................................................46
  • 5. Final Case A:Bank Revenues-PredictiveStatistical Analysis 4 AkankshaSinha© 2019 Overview Bank management wants to understand how their customer banking habits contribute to revenues and profitability. The data set BankRevnue.jmp has banking information of 7420 customers. The variables in the data set are listed below. Management would like to have model to forecast bank revenues to guide them in future marketing campaigns. A surrogate for customer’s profitability that is available in the data set is the variable Rev_Total which is the total revenue a customer generates through their accounts and transactions. Objective: In this paper, we will perform various statistical analysis to determine the best model that predicts profitability for a given customer. A 360° approach is followed to perform, compare & choose the best statistical analysis. We are performing our analysis in 3 major stages: In stage 1: descriptive statistics, noticing correlations, exploring missing values & outliers; In stage 2: Regression to have a baseline, ANOVA & then bring other tools like PCA, Clustering & Decision Tree. Within each model we will try to further enhance our model by performing stepwise, or/and removing least contributing variables or/and creating new variables. And in stage 3 we will compare our models and choose the best to recommend our predictions. Data: There are 16 variablesinthe dataset.
  • 6. Final Case A:Bank Revenues-PredictiveStatistical Analysis 5 AkankshaSinha© 2019 Stage 1: Descriptive Statistics Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data. (1) Initial Findings: Let’s inspect the data before running the various analyses. 1) Rev_Total: On average the available data for Rev_Total is not normally distributed as evident by graph & the P value less than 0.05. 2) Bal_Total: On average the available data for Bal_Total is not normally distributed as evident by graph & the P value less than 0.05.
  • 7. Final Case A:Bank Revenues-PredictiveStatistical Analysis 6 AkankshaSinha© 2019 3) Offer: There are 4063 members who received the promotional offer & have higher probability w.r.t who doesn’t. 4) CHQ: The frequency for debit card account activity denotes that there is almost equal probability of having low (or zero) vs greater account activity. 5) CARD: Again, for credit card activity too there is an equal chance of having low or zero & greater account activities.
  • 8. Final Case A:Bank Revenues-PredictiveStatistical Analysis 7 AkankshaSinha© 2019 6) SAV 1: The primary savings account activity record suggests that more than 80% of customers have low or zero account activity & almost 19% members have greater account activity. 7) LOAN: The loan account activity record suggests that more than 88% of customers have low or zero account activity & almost 11% members have greater account activity. 8) MORT: The data for mortgage suggests that 63% of customers are lower tier/less important and 37% of customers are higher tier/more important to the bank’s portfolio.
  • 9. Final Case A:Bank Revenues-PredictiveStatistical Analysis 8 AkankshaSinha© 2019 9) INSUR: The insurance account activity record suggests that almost 70% of customers have low or zero account activity & more than 30% members have greater account activity. 10) PENS: The data for retirements savings suggests that 48% of customers are lower tier/less important and 52% of customers are higher tier/more important to the bank’s portfolio. 11) Check: The checking account activity record suggests that more than 22% of customers have low or zero account activity & almost 78% members have greater account activity.
  • 10. Final Case A:Bank Revenues-PredictiveStatistical Analysis 9 AkankshaSinha© 2019 12) CD: The data for certificate of deposit account tier suggests that 88% of customers are lower tier/less important and 11% of customers are higher tier/more important to the bank’s portfolio. 13) MM: The money market account activity record suggests that more than 70% of customers have low or zero account activity & almost 30% members have greater account activity. 14) AccountAge:The AccountAge denotesthe numberof yearsasa customerof the bank. On average the AccountAge is 7 yearsand the minimumis0& maximumis26.
  • 11. Final Case A:Bank Revenues-PredictiveStatistical Analysis 10 AkankshaSinha© 2019 Note:There are 7420 records inthe Bank Revenue Dataset.The above table denotesthe statsforsome of the importantvariables. Taking Logs of Rev_Total& Bal_Total Since the Rev_Total & Bal_Total were not normallydistributedwe are creatingnew columnsbytaking logsof bothcolumns.
  • 12. Final Case A:Bank Revenues-PredictiveStatistical Analysis 11 AkankshaSinha© 2019 Correlations Correlation is a measure of the linear association between two variables. On the below figure some of the most important variable correlation suggests that Bal_Total has more significance on revenue generated by the customer over a 6-month period. From pairwise correlations we can notice the positive correlation of Bal_Total on Rev_Total. Treating Missing Data & Outliers. Note:There were nomissingvaluesfound. Note:The Outliersanalysisdoesn’tresultanysignificantvalueto considerremoving.
  • 13. Final Case A:Bank Revenues-PredictiveStatistical Analysis 12 AkankshaSinha© 2019
  • 14. Final Case A:Bank Revenues-PredictiveStatistical Analysis 13 AkankshaSinha© 2019 Stage 2: Running Models In the second stage we will run different combinations of model like Multiple Linear Regression, ANOVA, Decision Tree, Principal Componnets, LASSO, Clustering & Market Basket. It may be possible that some of the selected statistical analysis is not applicable to the given dataset. We will make our best judgement and do the analysis and combine the newly created variables to run the Multiple Linear Regression. We will also save the validation columns as well as the Prediction formula column so that we could perform the model comparison analysis in the stage 3 of this paper. Section1: Find a good/“best”model using multiple linear regression Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. Every value of the independent variable x is associated with a value of the dependent variable y. The population regression line for p explanatory variables x1, x2, ... , xp is defined to be y = 0 + 1x1 + 2x2 + ... + pxp. This line describes how the mean response y changes with the explanatory variables. The observed values for y vary about their means y and are assumed to have the same standard deviation . The fitted values b0, b1, ..., bp estimate the parameters 0, 1, ..., p of the population regression line. Since the observed values for y vary about their means y, the multiple regression model includes a term for this variation. In words, the model is expressed as DATA = FIT + RESIDUAL, where the "FIT" term represents the expression 0 + 1x1 + 2x2 + ... pxp. The "RESIDUAL" term represents the deviations of the observed values y from their means y, which are normally distributed with mean 0 and variance . The notation for the model deviations is . Formally, the model for multiple linear regression, given n observations, is yi = 0 + 1xi1 + 2xi2 + ... pxip + i for i = 1,2, ... n. (5)
  • 15. Final Case A:Bank Revenues-PredictiveStatistical Analysis 14 AkankshaSinha© 2019 Section 1.1: Multiple Linear Regressionusing all variables In this section, we will perform the multiple linear regression using the dependent variable ‘Rev_Total’ along with all the other independent variables. We will also examine the effect summary report, Summary of fit, parameter estimates and adjust the model. 1) In the Fig 1.1 the Effect Summary report lists in ascending p-value order the LogWorth or False Discovery Rate (FDR) LogWorth values. These statistical values measure the effects of the independent variables in the model. A LogWorth value greater than 2 corresponds to a p-value of less than .01, The FDR LogWorth is better statistic for assessing significance since it adjusts the p-values to account for the false discovery rate from multiple tests. In the Fig 1.1 We can notice that the LogWorth value for top 4 variables Log Bal_Total, CARD, Check & Offer are above 2 and the p-value is below .01. Fig 1.1
  • 16. Final Case A:Bank Revenues-PredictiveStatistical Analysis 15 AkankshaSinha© 2019 2) Evaluating the statistical significance of the Model: Fig 1.2  In the above figure we can notice that according to Parameter Estimates table, the multiple linear regression equation is as follows: Rev_Total = -2.53 + 0.4421* Bal_Total -0.06 * Offer - 0.00057 * Age -0.004403 * CHQ – 0.783241* CARD + 0.01095 * SAV1 + 0.0587 * Loan + 0.01605 * Mort – 0.03717 * INSUR - 0.000349 * PENS + 0.686 * Check  In fig 1.2 we can see that the p-value for the F test is <0.0001. So we reject the F test and can conclude that one or more of the independent variables (that is one or more of variables Bal_total, Offer, Age, Chq, Card, Sav1, Loan, Mort, Insur, Pens, Check, CD, MM, Savings, and AccountAge) is significantly related to Rev_Total.  T test for each Independent Variable: We can see in figure 1.2 above that Bal_Total, Offer, Card, Loan, Insur and Check each reject H0 and are significantly related to Rev_Total above and beyond.
  • 17. Final Case A:Bank Revenues-PredictiveStatistical Analysis 16 AkankshaSinha© 2019  Multicollinearity (VIF): It seems that most of the independent variables have no significant multicollinearity as they fall below 5. Card, and Check have >5, high VIFs, that suggests the presence of multicollinearity.  Let’s examine the multiple linear regression further to obtain the base line values. We find an adjusted R² = 0.5988 and an RMSE = 0.8105 RMSE is measure of the average deviation of the estimates from the observed values or is the square root of the variance of the residuals. But R² is the fraction of the total sum of squares that is explained by the regression. As we need to do forecasting, we should consider RMSE here. Section1.2:ModelImprovement Steps Stepwise Regression Fig 1.3 In the figure above we can notice that the RMSE has not changed much and the adjusted R² is decreased. We will attempt to further improve the model by removing least contributing variables. VIF is greater than 5 for Card.
  • 18. Final Case A:Bank Revenues-PredictiveStatistical Analysis 17 AkankshaSinha© 2019 Excluding Variables Fig 1.4 Removingthe non-significantcolumnshave notsignificantlychangedthe R²& RMSE. Section2: Find a good/“best”model using DecisionTree
  • 19. Final Case A:Bank Revenues-PredictiveStatistical Analysis 18 AkankshaSinha© 2019 A decision tree is a hierarchical collection of rules that specify how a data set is to be broken up into smaller groups based on a target variable (dependent y variable). If the target variable is categorical, then the decision tree is called a classification tree. If the target variable is continuous, then the decision tree is called a regression tree. Our dataset has mostly categorical data. JMP automatically chooses the variable and split that maximize the LogWorth statistic. Section2.1:Find a good/“best”model using DecisionTree without the Log Values Fig 2.1 In the above figure we can notice that the splits were made on Bal_Total. We splitted the tree on best and kept on pruning the worst split.
  • 20. Final Case A:Bank Revenues-PredictiveStatistical Analysis 19 AkankshaSinha© 2019 Fig 2.2 In the above figure we can notice that the node for Bal_Total<$2 has been split into a leaf for Bal_Total<$1. We will perform a couple of more splits to observe the changes in LogWorth values. Note: The logworth statistic is used for pruning or growing a tree. It is defined as the –log(p- value). Typically, if the logworth is greater than 2, then the variable that is used in the branch is significant and should be included in the tree.
  • 21. Final Case A:Bank Revenues-PredictiveStatistical Analysis 20 AkankshaSinha© 2019 Fig 2.3 In the above figure we cannotice that the LogWorth value forBal_Total<1 is2.84 and inthe next split the LogWorth is1.21, and finallyafterPENS(1) the tree couldn’tbe furthersplitted.Hence we will prune all the below2 Logworthsplits. Fig 2.4
  • 22. Final Case A:Bank Revenues-PredictiveStatistical Analysis 21 AkankshaSinha© 2019 In the above figure, we can notice that the Bal_Total variable is the sole contributor. Fig 2.4 In the above figure, we can notice that the Leaf report shows the mean and counts of the bottom- level leaves. Note: We have saved the prediction formula column to our dataset for model comparison in stage 3. Section2.2:Find a good/“best”model using DecisionTree with Log values for Rev_Total& Bal_Total
  • 23. Final Case A:Bank Revenues-PredictiveStatistical Analysis 22 AkankshaSinha© 2019 Here in this section we can notice that taking log helps with improving the R² though RMSE also increased. Section3: Find a good/“best”model using ANOVA Analysis of Variance (ANOVA) is a statistical method used to test differences between two or more means. As the name suggests inferences about means are made by analyzing variance. ANOVA is used to test general rather than specific differences among means. Section3.1:One-wayANOVA Performing a One-way ANOVA: In the previous sections we realized that the Bal_Total was most contributing factor in Rev_Total. Hence for one-way ANOVA we changed the Bal_Total as nominal and observed the results below:
  • 24. Final Case A:Bank Revenues-PredictiveStatistical Analysis 23 AkankshaSinha© 2019 Fig 3.1 Fig 3.2 In the above figure we can notice that the R² is 0.92, adjusted R² is 0.65 and RMSE is 1.89. The F test here suggests that the null hypothesis is rejected since the P value is <0.01. There’s not much to further do since the JMP is not showing results for Levine, Welch’s,or Tukey’s HSD tests.
  • 25. Final Case A:Bank Revenues-PredictiveStatistical Analysis 24 AkankshaSinha© 2019 Fig 3.3 Performing a one-way ANOVA on Card & Offer As insection1 we noticedthe significantimpactof nominal variablesCard&offer,we are further analyzingtheirimpact. Here alsowe will rejectthe null hypothesisforthe Ftest. Section3.2:Two-wayANOVA Performing a two-way ANOVA on Card & Offer
  • 26. Final Case A:Bank Revenues-PredictiveStatistical Analysis 25 AkankshaSinha© 2019 We can see here thatlogworthvalue of Cardis 1.783 whichissignificantlybetterthanothers.
  • 27. Final Case A:Bank Revenues-PredictiveStatistical Analysis 26 AkankshaSinha© 2019 Here we can notice thatfromconnecting letterreportthatlevelsnotconnectingbysame letterare significantlydifferent. Section3.3:Evaluating model for equal variances An attempttoevaluate model by addingthe new column forconcatenatedcard& offerthoughthe one- wayanova was notrunningonit as theirdata type wasnot compatible.
  • 28. Final Case A:Bank Revenues-PredictiveStatistical Analysis 27 AkankshaSinha© 2019 Section4: Running a Principal Component Analysis The Principal Component analysis (PCA) is an exploratory multivariate technique with two overall objectives. First objective is “dimension reduction” that is reducing a several variables to a few with a minimum loss of information and second objective is to “discover the structure in the relationships between the variables”. Fig 4.1 Fig 4.2 The first eigenvalue, 4.0061 is larger than the second eigenvalue, 2.9642. This suggests that the first principal component, Prin1, is much more important (in terms of explaining the variation in the pair of variables) than the second principal component, Prin2. Also, the bar for the percentage of variation accounted for by Prin 1 is about 25.038% & Prin 2 is about 18.52%. The first five principal components have a cumulative percentage of 72.65%. The scatterplot below is clustered towards different quadrants indicating a prospective correlation between the principal components.
  • 29. Final Case A:Bank Revenues-PredictiveStatistical Analysis 28 AkankshaSinha© 2019 In the Loading plot the variables are nearly clustered into 4 groups. In the Fig 4.2 the Eigenvectors table suggests the figures in darker shade is statistically significant and may have an impact on Revenue. The Screeplot suggests that Prin 1 is highly significant and the elbow on Prin 5 suggests that we should consider five principal components for our analysis. Fig 4.3 Fig 4.4
  • 30. Final Case A:Bank Revenues-PredictiveStatistical Analysis 29 AkankshaSinha© 2019 Fig 4.5 After inspecting Rotated factor loading, we found that the results are similar to score plot & eigenvectors as the Insur, MM & the savings are the main driving factors of Principal component 1. Section5: LASSO The Lasso is a shrinkage and selection method for linear regression. It minimizes the usual sum of squared errors, with a bound on the sum of the absolute values of the coefficients. It has connections to soft-thresholding of wavelet coefficients, forward stagewise regression, and boosting methods. Fig 5.1
  • 31. Final Case A:Bank Revenues-PredictiveStatistical Analysis 30 AkankshaSinha© 2019 Fig 5.2 In the above figure we can notice that the CD, MM & saving have been removed as this is one of the advantages of the LASSO that the variables often hit the zero axis in groups, enabling the modeler to drop several variables at a time. From the effect test above the most important variables are Loan, Bal_Total, Card & Insur. Section6: Running Clustering Cluster analysis is an exploratory multivariate technique designed to uncover natural groupings of the rows in a data set. Clustering is helpful when we have more than 2 dimensions in a dataset. Cluster analysis is a technique where no dependence in any of the variables is required. The objective of the cluster analysis is to divide the data set into groups, where the observations within each group are relatively homogeneous, and yet the groups are different than each other. The initial dendrogram defaults to five clusters and the scree plot appears to have an elbow somewhere between the fourth and fifth point. We examined the clusters briefly to obtain a baseline for comparison.
  • 32. Final Case A:Bank Revenues-PredictiveStatistical Analysis 31 AkankshaSinha© 2019 Hierarchical Clustering Fig 6.1 Fig 6.2 The initial dendrogram defaults to four clusters. We examined the clusters briefly to obtain a baseline for comparison. Cluster 4 has the most observations i.e. 2816. Cluster groupings are described in greater detail in the K Means section below since the Hiearchical analysis is meant for initial inspection.
  • 33. Final Case A:Bank Revenues-PredictiveStatistical Analysis 32 AkankshaSinha© 2019 K Means We also conducted K Means for 6 clusters. The standard biplot has overlaps which was further confirmed by a closer look at the 3D biplot. Fig 6.3 Fig 6.4 The K Means output shows that Cluster 1 could be a outlier with only one observation. The clusters can be described as the following: Cluster 1: Only 1 observation with the highest Rev_Total Cluster 2: This cluster has highest Bal_Total Cluster 3: It has 813 records
  • 34. Final Case A:Bank Revenues-PredictiveStatistical Analysis 33 AkankshaSinha© 2019 Cluster 4: It has 101 records Cluster 5: It has the highest number of observations with 2801 Cluster 6: It has 1915 records The K means method is intended for use with larger data tables, from approximately 200 to 100,000 observations and the result can be highly sensitive to the order of the observations in the data table. Section7: MarketBasketAnalysis Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy. Association Rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules. Note: Due to lack of Customer ID or any sort of ID it was difficult to run the Market Basket Analysis. Section8: Find a good/“best”model using multiple linear regressionwith the 5 principal components In this section, we will run the multiple linear regression with the 5 principal components that we got from Section 4. Here we will notice the predicted plots, LogWorth, PValue, and R square value. The term leverage is used because these plots help you visualize the influence of points on the test for including the effect in the model. A point that is horizontally distant from the center of the plot exerts more influence on the effect test than does a point that is close to the center.  First ran the Multiple Linear Regression with 5 components Received R² = 0.661385 and an RMSE = 1.866957  Then ran the stepwise analysis Multiple Linear Regression with 5 components, though noticed it automatically removed Prin 2. Received R² = 0.661325 and an RMSE = 1.86696  Next to further improve the model removed Prin1 & ran the stepwise analysis Multiple Linear Regression with 4 components. As earlier Prin2 was automatically removed. Received R² = 0.65757 and an RMSE = 1.87719 We can see that all models thus far produce approximately the same results.
  • 35. Final Case A:Bank Revenues-PredictiveStatistical Analysis 34 AkankshaSinha© 2019 Section9: Find a good/“best”model using multiple linear regressionwith the clusters Afterrunningthe Multiple LinearRegressionwithclusterswe gotthe R² of 0.223874 & the RMSE of 1.126958. Most of the clustersare statisticallysignificantatthe significance level of .05level andalmost all the VIFsare greaterthan 5.
  • 36. Final Case A:Bank Revenues-PredictiveStatistical Analysis 35 AkankshaSinha© 2019
  • 37. Final Case A:Bank Revenues-PredictiveStatistical Analysis 36 AkankshaSinha© 2019 Section9.1:ModelImprovement – Stepwise We will nowrerunthe model asa stepwise tofurtherimprovethe model. Here in the above figure we cannotice that the VIFhas now decreased andall the VIFsare nearly around5. Thoughthere’snotmuchchangedin the R² and RMSE value. Section10:Find a good/“best” modelusing multiple linear regressionwith the clusters and the original data In thismodel we ranthe multiple linearregressionwiththe clustersandthe original data.We can notice that ineffectsummary withinfig9.1,many of the variableslike CARD,AGEetcare below 2 logworth values. The R² is0.686 and the RMSE is0.7172. Withinthe parameterestimateswe cannotice thatthe VIFsformany variablesare greaterthan5. And fewof the variablesare statisticallynonsignificant.
  • 38. Final Case A:Bank Revenues-PredictiveStatistical Analysis 37 AkankshaSinha© 2019 Fig 9.1
  • 39. Final Case A:Bank Revenues-PredictiveStatistical Analysis 38 AkankshaSinha© 2019 Section10.1:Model Improvement – Stepwise Fig 9.2 In the above figure we cannotice that the statisticallynon-significantvariablesare removed automatically.Thoughthere’snotmuchchangedwiththe R² and the RMSE. Logworthis goodfor all the selectedvariables.Manyof the VIF’sare still veryhigh.
  • 40. Final Case A:Bank Revenues-PredictiveStatistical Analysis 39 AkankshaSinha© 2019 Section10.2:Model Improvement – Removing Variables We will remove the INSRandCDas theyhave VIFs. The R² is decreasedandthe RMSE is now increased. Fig 9.3
  • 41. Final Case A:Bank Revenues-PredictiveStatistical Analysis 40 AkankshaSinha© 2019 Section11:Find a good/“best” modelusing multiple linear regressionwith the clusters and the 5 components Fig 10.1
  • 42. Final Case A:Bank Revenues-PredictiveStatistical Analysis 41 AkankshaSinha© 2019 We ran the multiple linearregressionwiththe clustersandthe 5 components.We noticedthatPrin1 has Logworthvalue below2.The VIFsare highforclusters& the Prin3. We will furtherattemptto improve the model. Section11.1:Model Improvement- Stepwise Fig 10.2 Here we can notice thatthe R² value isdecreasedandthe RMSE isincreased thoughthe LogWorthvalue for eachvariable isabove 2. VIFforclustersand the Prin1and Prin3 are greaterthan5.
  • 43. Final Case A:Bank Revenues-PredictiveStatistical Analysis 42 AkankshaSinha© 2019 Section11.2:Model Improvement- Removing Variables Fig 10.3 Here we can notice that the R² value is decreased and the RMSE is increased though the LogWorth value for each variable is above 2. VIF for clusters are greater than 5.
  • 44. Final Case A:Bank Revenues-PredictiveStatistical Analysis 43 AkankshaSinha© 2019 Stage 3: Model Comparison Section1: Creating a validation column Fig 1 Section2: Saving PredictionFormulas for eachmodel While running 10 different combination of models in stage 2 we saved prediction formula to compare our models. Fig 2
  • 45. Final Case A:Bank Revenues-PredictiveStatistical Analysis 44 AkankshaSinha© 2019 Section3: Performing ModelComparison Fig 3 Fig 4 The above figure givesthe snapshotof model comparisonbyvalidation set and the predicted formulas.
  • 46. Final Case A:Bank Revenues-PredictiveStatistical Analysis 45 AkankshaSinha© 2019 ModelComparison Summary In thissectionwe will summarize ourstatistical analysis,asevidentfromthe above summarytable the Section10 is the bestmodel withthe R² = 0.686157, Adj.R² = 0.685352 andthe RMSE = 0.717263. An attemptto furtherimprove the model bydoingstepwise andnon-significantvariableremoval wasnot helpful inimprovingthe R²or RMSE. Thoughthere wassome improvement inthe LogWorth&VIFs.
  • 47. Final Case A:Bank Revenues-PredictiveStatistical Analysis 46 AkankshaSinha© 2019 References 1. Descriptive Statistics: https://socialresearchmethods.net/kb/statdesc.php 2. HypothesisTesting: http://mathworld.wolfram.com/HypothesisTesting.html 3. Testconcerningthe meanof Normal Population: http://demonstrations.wolfram.com/ThePowerOfATestConcerningTheMeanOfANormalPopulati on/ 4. AssessingNormality: https://www.jmp.com/en_hk/learning-library/probabilities-and- distributions.html 5. AnalysisToolpak: https://support.office.com/en-us/article/use-the-analysis-toolpak-to-perform- complex-data-analysis-6c67ccf0-f4a9-487c-8dec- bdb5a2cefab6?NS=EXCEL&Version=90&SysLcid=1033&UiLcid=1033&AppVer=ZXL900&HelpId=xl addin.chm1786&ui=en-US&rs=en-US&ad=US 6. Correlation:https://www.jmp.com/content/dam/jmp/documents/en/academic/learning- library/05-correlation.pdf 7. RMSE & R²: https://www.quora.com/What-is-the-difference-between-RMSE-and-R-squared-in- statistics 8. ANOVA:http://onlinestatbook.com/2/analysis_of_variance/intro.html 9. LASSO:http://statweb.stanford.edu/~tibs/lasso.html 10. Market BasketAnalysis: https://towardsdatascience.com/a-gentle-introduction-on-market- basket-analysis-association-rules-fa4b986a40ce 11. Market BasketAnalysis: http://databoosting.com/boosting-revenue-market-basket-analysis/