SlideShare a Scribd company logo
1
MS-CAPSTONE
(BANA 6064)
CARL H. LINDNER COLLEGE OF BUSINESS
SUMMER 2016
PREDICTING CREDIT CARD DEFAULTS
Understanding the concept of default, why it happens and the components used to predict the
default of credit card holders
Submitted in Partial Fulfillment for the Requirements for the Degree of Master of Science in
Business Analytics
TO:
Prof. Yichen Qin (1st Reader)
Prof. Peng Wang (2nd Reader)
BY:
Sagar Vinaykumar Tupkar
tupkarsr@mail.uc.edu
M08773948
2
ABSTRACT
Credit Card defaults poses a major problem to all the major financial service providers today as they have
to invest a lot of money in collection strategy, which again is uncertain. The analysts in financial industry
today have achieved great success in plotting a method to predict the default of credit card holder based
on various factors. This study aims at using the previous 6 months’ data of the customer to predict
whether the customer will go default in the next month by various statistical and data mining techniques
and building different models for the same. The exploratory data analysis part is also important to check
the distributions and patterns followed by the customers which eventually lead to default. Out of the four
models built, Logistic Regression after doing Principal Component Analysis and Adaptive Boosting
Classifier performed the best in predicting defaults with around 83% accuracy and minimizing the penalty
to the company. This study gave list of important variables that affects the model and should be
considered for predicting defaults. Even though the accuracy of the predictions is good, further research
and powerful techniques can potentially enhance the results and bring a revolution in the credit card
industry.
3
Contents
ABSTRACT..................................................................................................................................................2
1. INTRODUCTION.................................................................................................................................4
1.1. Credit-Card Default Definition – ...............................................................................................4
1.2. Background and Current Situation of Credit Card Defaults –...................................................4
2. OBJECTIVE OF THE STUDY –..............................................................................................................4
3. DATA .................................................................................................................................................5
4. EXPLORATARY DATA ANALYSIS.........................................................................................................9
4.1. Gender based Distribution:.....................................................................................................10
4.2. Education based Distribution:.................................................................................................10
4.3. Age based distribution:...........................................................................................................11
4.4. Marital Status based Distribution:..........................................................................................11
4.5. Credit-Line based distribution: ...............................................................................................12
4.6. Distribution of Payment statistics in October 2015................................................................13
4.7. Distribution of Payment statistics in November 2015............................................................13
4.8. Distribution of Payment statistics in December 2015 ............................................................14
4.9. Distribution of Payment statistics in January 2016.................................................................15
4.10. Distribution of Payment statistics in February 2016...........................................................15
4.11. Distribution of Payment statistics in March 2016...............................................................16
5. MODEL PREPARATION ....................................................................................................................17
5.1. Logistic Regression Model.......................................................................................................17
5.2. Classification Tree ...................................................................................................................22
5.3. Artificial Neural Network ........................................................................................................26
5.4. Linear Discriminant Analysis ...................................................................................................26
6. MODEL COMPARISON.....................................................................................................................27
7. CONCLUSION...................................................................................................................................28
8. REFERENCES –.................................................................................................................................29
4
1. INTRODUCTION
1.1. Credit-Card Default Definition –
When a customer applies for and receive a credit card, it becomes a huge responsibility for customer as
well as the credit card issuing company. The credit card company evaluates the customer’s credit
worthiness and gives him/her a line of credit that they feel the customer can be responsible for. While
most people will use their card to make purchases and then diligently make payments on what they
charge, there are some people who, for one reason or another, do not keep up on their payments and
eventually go into credit card default.
Credit card default is the term used to describe what happens when a credit card user makes purchases
by charging them to their credit card and then they do not pay their bill. It can occur when one payment
is more than 30 days past due, which may raise your interest rate. Most of the time, the term default is
used informally when the credit card payment is more than 60 days past due. A default has a negative
impact on the credit report and most likely lead to higher interest rates on future borrowing.
1.2. Background and Current Situation of Credit Card Defaults –
The U.S. economy is growing at just 2.5% a year, but credit card lending is rising more than twice as fast:
5% over year-earlier levels each month since last fall, accelerating to 6% in March 2016 and April 2016,
says Federal Reserve data. That's the fastest card debt has grown since card lending fell in the 2009
recession and since Americans aren't earning that much more, won't delinquencies, charge-offs and
bankruptcies be rising in another year or two?
As a matter of fact, for the 9 dominant U.S. credit card banks, which control 70% of the Visa-MasterCard-
American Express-Discover-Chinapay market in the U.S., average charge-offs in early 2016 were 3.13% of
annualized average loans, "down from a peak of 9.9% in 2009."
In recent years, the credit card issuers are facing the cash and credit card debt crisis as they have been
over-issuing cash and credit cards to unqualified applicants, in order to increase their market share. At
the same time, most cardholders, irrespective of their repayment ability, overused credit card for
consumption and accumulated heavy credit and cash–card debts. The crisis is an omen for the blow to
consumer finance confidence and it is a big challenge for both banks and cardholders.
2. OBJECTIVE OF THE STUDY –
This project is an attempt to identify credit card customers who are more likely to default in the coming
month. A lot of credit card issuing companies are working on predictive models which would help them
predict the payment status of the customer ahead of time using the customer’s credit score, credit history,
payment history and other factors. A lot of statistical models to predict delinquency are extant in the
5
financial industry today, however, as a famous quote goes, “All models are wrong, only some are useful”,
our attempt to build another predictive model has its own small impact in the research today.
This project is aimed at using customer’s personal and financial information like credit line, age,
repayment and delinquency history for the past 6 months to predict the probability of the particular
customer to become default next month. Many statistical and data mining techniques will be used to build
a binary predictive model.
If the credit card issuing companies can effectively predict the imminent default of customers beforehand,
it will help them to pursue targeted customers and take calculated efforts to avoid the default, to
overcome future losses efficiently.
3. DATA
As mentioned earlier, this project will use customer’s personal and financial information like credit line,
age, repayment and delinquency history for the past 6 months to predict the probability of the particular
customer to become default next month. The data was provided by one of the famous credit card issuing
banks of the USA and contains proprietary information about the customers (e.g. account numbers, which
have been masked). The data, in any sense, does not directly reveal identity of any individual or provide
information that could be decrypted to connect to an individual.
In this project, the plan is to predict the probability of credit-card holders to go default in the next month
by using payment data from October 2015 to March 2016. Among the total 30,000 observations, 6636
observations (22.12%) are the cardholders with default payment. To determine the binary variable –
default payment in April 2016 (Yes = 1, No = 0), as the response variable, the following 23 variables would
be used as explanatory variables:
1. X1: Amount of the given credit (USD): it includes both the individual consumer credit and his/her
family (supplementary) credit.
Including this variable in the study is important as the credit line of a customer is a good indicator
of the financial credit score of the customer. Using this variable will help the model predict
defaults more effectively.
2. X2: Gender (1 = male; 2 = female)
It might be useful to see whether the gender of the customer is in any way related to his/her
probability of default. The distribution of defaults based on gender will be an interesting chart to
look at.
3. X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others
It might be useful to see whether the education level of the customer is in any way related to
his/her probability of default. The distribution of defaults based on education level will be an
interesting chart to look at.
6
4. X4: Marital status (1 = married; 2 = single; 3 = others)
It might be useful to see whether the marital status of the customer is in any way related to his/her
probability of default. The distribution of defaults based on marital status will be an interesting
chart to look at.
5. X5: Age (year)
It might be useful to see whether the gender of the customer is in any way related to his/her
probability of default. The distribution of defaults based on age buckets will be an interesting
chart to look at.
6. X6 - X11: History of past payment. Customers’ past monthly payment records (from October 2015
to March, 2016) were tracked and used in the dataset as follows:
X6 = the repayment status in March, 2016;
X7 = the repayment status in February, 2016;
. . .;
X11 = the repayment status in October, 2015.
The measurement scale for the repayment status is:
-2 = Minimum due payment scheduled for 60 days
-1 = Minimum due payment scheduled for 30 days
0 = pay duly;
1 = payment delay for one month;
2 = payment delay for two months;
. . .;
8 = payment delay for eight months and above;
This information is very crucial as it directly provides the payment status of the customer for the
past 6 months. Using these variables will train the model efficiently to predict defaults.
7. X12-X17: Amount of bill statement (USD)
X12 = amount of bill statement in March, 2016;
X13 = amount of bill statement in February, 2016;
. . .;
X17 = amount of bill statement in October, 2015.
7
Actual bill statements of the customers for the past 6 months would give a quantitative estimate
for the amount spent by the customer using the credit card.
8. X18-X23: Amount of previous payment (USD)
X18 = amount paid in March, 2016;
X19 = amount paid in February, 2016;
. . .;
X23 = amount paid in October, 2015.
Amount of USD paid by the customers in past 6 months would give the repayment ability of the
customer and the pattern for payment could be used to train the model efficiently.
The variable names, example, data type and description is provided below in Figure 1.
Column name Variable Names Example Data Type Description
ID ID 23 Integer Masked Account numbers of Customers
Y default payment next month 1 Binary Binary variable (1,0) with 1 being customer defaults in next month
X1 LIMIT_BAL 2170 Continuous numeric Credit line of the customer
X2 SEX 2 Factor Gender of the customer 1= male 2=female
X3 EDUCATION 2 Factor Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
X4 MARRIAGE 2 Factor Marital status (1 = married; 2 = single; 3 = others)
X5 AGE 26 Integer Age (year)
X6 PAY_1 2 Factor repayment status in March, 2016
X7 PAY_2 0 Factor repayment status in February, 2016
X8 PAY_3 0 Factor repayment status in January, 2016
X9 PAY_4 2 Factor repayment status in December, 2015
X10 PAY_5 2 Factor repayment status in November, 2015
X11 PAY_6 2 Factor repayment status in October, 2015
X12 BILL_AMT1 1273.70 Continuous numeric amount of bill statement in March, 2016 (in USD)
X13 BILL_AMT2 1315.80 Continuous numeric amount of bill statement in February, 2016 (in USD)
X14 BILL_AMT3 1395.62 Continuous numeric amount of bill statement in January, 2016 (in USD)
X15 BILL_AMT4 1364.19 Continuous numeric amount of bill statement in December, 2015 (in USD)
X16 BILL_AMT5 1454.06 Continuous numeric amount of bill statement in November, 2015 (in USD)
X17 BILL_AMT6 1426.37 Continuous numeric amount of bill statement in October, 2015 (in USD)
X18 PAY_AMT1 62.22 Continuous numeric amount paid in March, 2016 (in USD)
X19 PAY_AMT2 111.04 Continuous numeric amount paid in February, 2016 (in USD)
X20 PAY_AMT3 0.00 Continuous numeric amount paid in January, 2016 (in USD)
X21 PAY_AMT4 111.63 Continuous numeric amount paid in December, 2015 (in USD)
X22 PAY_AMT5 0.00 Continuous numeric amount paid in November, 2015 (in USD)
X23 PAY_AMT6 56.42 Continuous numeric amount paid in October, 2015 (in USD)
Figure 1: Data Dictionary
8
In order to get the gist of the data, Figure 2 shows a snapshot of a subset of observations in the dataset,
Figure 3 provides the information like datatype and levels etc. about the variables and Figure 4 provides
a statistical summary of the data –
Figure 3: Description of variables in the dataset
Figure 2: Snapshot of the top 15 observations of the dataset
9
Figure 4: Summary of the variables in the dataset
4. EXPLORATARY DATA ANALYSIS
From the description of the data above, it can be concluded that the data did not have any null values for
any of the variables. We will start by doing an initial exploratory data analysis by looking at the distribution
of different variables with Y=0 and Y=1; so that the behavior of default and non-default customers can be
analyzed.
The value of the variables was aggregated and the total was plotted against the number of customers;
hence a frequency chart was prepared for the class variables and insights were drawn from the
visualizations. The results were separately presented for Y=0 and Y=1 i.e. default and non-default
customers (red being default and green being non-default) so that the analysis becomes easier.
10
4.1. Gender based Distribution:
The bar chart was plotted for distribution of customers based on their gender and is shown in Figure 5.
The result is separated by default and non-default customers (red being default and green being non-
default).
It can be observed that out of the 12k male credit card holders, 24.17% of the customers were default
whereas out of the 18k female credit card holders, 20.78% of the customers were default. Although the
total number of female customers are more than male customers, the percentage of male default
customers are more than that of the female customers.
4.2. Education based Distribution:
The bar chart was plotted for distribution of customers based on their education and is shown in Figure
6. The result is separated by default and non-default customers (red being default and green being non-
default).
Figure 6: Distribution by Education
Figure 5: Distribution by Gender
11
It can be observed that most of the credit card holders (~14k) are University pass-outs followed by
Graduates and High School pass-outs. Although the percentage of default doesn’t vary significantly
amongst these customers based on their education, it is worth noticing that 25.16% of the high school
pass-out customers get defaulted while the number decreased to 23.73% for University and 19.23% for
Graduate School.
4.3. Age based distribution:
The customers were categorized into bins and bar chart plotted for distribution of customers based on
their age is shown in Figure 7. The result is separated by default and non-default customers (red being
default and green being non-default).
Figure 7: Distribution by Age
It can be observed that the credit card holders who are aged less than 25 years have the maximum
(~27.20%) of default proportion followed by the age group 45-60 years (25.13%). While the number of
customers in age group 25-35 years are maximum in numbers, their default proportion (~20.30%) is pretty
decent.
4.4. Marital Status based Distribution:
The bar chart was plotted for distribution of customers based on their marital status and is shown in Figure
8. The result is separated by default and non-default customers (red being default and green being non-
default).
12
Figure 8: Distribution by Marital Status
It can be observed that the credit card holders who are married have the maximum (~23.47%) of default
proportion as compared to ~20.93% of Single customers. Although married customers are lesser in
number than single customers, the default proportion is higher amongst married credit card holders.
4.5. Credit-Line based distribution:
The customers were categorized into bins and bar chart plotted for distribution of customers based on
their credit line is shown in Figure 9. The result is separated by default and non-default customers (red
being default and green being non-default).
Figure 9: Distribution by Credit Line
13
It can be observed that for the customers whose credit line is between $1000-$5000 are maximum in
number and have 2nd
highest default proportion (24.46%) after customers with credit line between $500-
$1000 which is alarmingly high (35.30%)
4.6. Distribution of Payment statistics in October 2015
For the month of October 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 10. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 11.
Figure 10: Repayment status distribution in Oct'15 Figure 11:Bill and Paid amount distribution in Oct'15
It can be observed that in the month of October 2015, while most of the customers paid their loan duly,
almost 50% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~9% of the bill statement in October 2015
as opposed to ~14% by non-default customers.
4.7. Distribution of Payment statistics in November 2015
For the month of November 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown Figure 12. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 13.
14
Figure 12: Repayment status distribution in Nov'15 Figure 13: Bill and Paid amount distribution in Nov'15
It can be observed that in the month of November 2015, while most of the customers paid their loan duly,
almost 54% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~8% of the bill statement in November
2015 as opposed to ~13% by non-default customers.
4.8. Distribution of Payment statistics in December 2015
For the month of November 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown Figure 14. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 15.
Figure 14: Repayment status distribution in Dec'15 Figure 15: Bill and Paid amount distribution in Dec'15
15
It can be observed that in the month of December 2015, while most of the customers paid their loan duly,
almost 52% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in December
2015 as opposed to ~12% by non-default customers.
4.9. Distribution of Payment statistics in January 2016
For the month of January 2016, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 16. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 17.
Figure 16: Repayment status distribution in Jan'16 Figure 17: Bill and Paid amount distribution in Jan'16
It can be observed that in the month of January 2016, while most of the customers paid their loan duly,
almost 52% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in January
2016 as opposed to ~12% by non-default customers.
4.10. Distribution of Payment statistics in February 2016
For the month of February 2016, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 18. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 19.
16
Figure 18: Repayment status distribution in Feb'16 Figure 19: Bill and Paid amount distribution in Feb'16
It can be observed that in the month of February 2016, while most of the customers paid their loan duly,
almost 56% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~7% of the bill statement in February
2016 as opposed to ~13% by non-default customers.
4.11. Distribution of Payment statistics in March 2016
For the month of March 2016, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 20. Also, the total Statement Bill amount and
the amount paid by the credit card holder was divided into default and non-default customers to analyze
their situation separately and is shown below in Figure 21.
Figure 20: Repayment status distribution in Mar'16 Figure 21: Bill and Paid amount distribution in Mar'16
17
It can be observed that in the month of March 2016, while most of the customers paid their loan duly,
almost 70% of the customers who had a payment delay for 2 months and 34% of the customers who had
a payment delay for 1 month went default in April 2016. On the other hand, customers who went default
in April 2015 paid only ~7% of the bill statement in March 2016 as opposed to ~12% by non-default
customers.
5. MODEL PREPARATION
The aim of this exercise is to build a model using the variables explained in the earlier section to predict
the credit card holders which would go default next month. The data that would be used to train the
model would be past 6 months financial, delinquency and payment history. Hence, in order to build the
model, we have divided the data into training (80%) and testing (20%) subsets. Multiple classifiers were
used to build the model using the training dataset which contained 24,000 observations and were
compared based on the various model performance metrics. We will go through each model separately
and discuss the scope, performance and pros-cons related to every classifier method.
5.1. Logistic Regression Model
Logistic regression can be considered a special case of linear regression models. However, the binary
response variable violates normality assumptions of general regression models. A logistic regression
model specifies that an appropriate function of the fitted probability of the event is a linear function of
the observed values of the available explanatory variables. The major advantage of this approach is that
it can produce a simple probabilistic formula of classification. The weaknesses are that LR cannot properly
deal with the problems of non-linear and interactive effects of explanatory variables.
A logistic regression model was fit on the training dataset using all the variables and the summary of the
model can be seen in Table 1 below –
Table 1: Logistic full model summary
Logistic Regression AIC Null Deviance Residual Deviance
Model Summary 20979 25314 20815
It was observed that some of the dummy variables that were created because of the presence of
class/nominal variables in the dataset were not significant in the above full logistic regression model and
hence a stepwise variable selection method for regression was performed.
5.1.1. Stepwise Variable Selection Method
By performing stepwise variable selection, it was observed that some of the variables were omitted
because they were insignificant in the full model. Finally, the new model was –
Y ~ X1 + X2 + X3 + X4 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X18 + X19 + X20 + X22
18
It can be seen that the variables X5, X14, X15, X16, X17, X21 and X23 were omitted from the model. The
summary of the new model can be seen in Table 2 below –
Table 2: Logistic stepwise model summary
Logistic Stepwise Regression AIC Null Deviance Residual Deviance
Model Summary 20970 25314 20820
The variables that were omitted viz. age, bill statement amounts and paid amounts of few months are
important according to business knowledge and are highly recommended to be kept in the model to be
used. Moreover, even after removing these variables, the AIC of the model hasn’t decreased significantly.
5.1.2. LASSO variable selection –
In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or
LASSO) is a regression analysis method that performs both variable selection and regularization in order
to avoid overfitting, enhance the prediction accuracy and interpretability of the statistical model it
produces. In this project, 5-fold cross validation method was used to select the Tuning Parameter (lambda)
that gets inside the LASSO optimization problem.
The entire data was used to do 5-fold cross validation and LASSO variable selection and the behavior of
binomial distribution for different tuning parameter- lambda was plotted. The optimum value of lambda
is given by the vertical line in the plot shown below. Hence the value of tuning parameter, lambda is
around 0.004.
Figure 22: Binomial Deviance plot to choose the tuning parameter-lambda
Now, using this lambda, variable selection was done using LASSO variable selection method. As far as the
Null Deviance of the model is concerned, it was 31704.85 and hence because of the higher value as
compared to full logistic and stepwise model, this model couldn’t be used.
19
After comparing the three different versions of logistic model, it was concluded that the full logistic model
has better parameter values than the other two. Hence, we will use the full logistic regression model
instead of the reduced model or LASSO fit for further analysis.
To check the model in-sample and out-sample performance, the response variable was predicted using
the cut-off probability as 0.2 according to the traditional value used by the company for default
predictions. The ROC curves for in-sample and out-sample predictions of full logistic model are shown in
Figure 23 and Figure 24 respectively.
Figure 23: ROC Curve for in-sample Logistic model predictions Figure 24: ROC Curve for out-sample Logistic model predictions
The in-sample and out-sample performance of the full logistic model is given in Table 3 below –
Table 3: Logistic full model in and out sample performance metrics
Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.7719 0.2038 0.3849 0.2437
Out-sample 0.7717 0.2016 0.3838 0.2425
The logistic full regression model is able to predict the defaults for training dataset with 0.2437 error rate
and with 0.2425 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and out
sample predictions is around 0.77. It can be concluded that logistic regression is a good fit for the data
and shows a considerable prediction power.
20
5.1.3. Principal Component Analysis
Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables,
called "principal components", from a large set of data. The goal of principal components analysis is to
explain the maximum amount of variance with the fewest number of principal components. Principal
components analysis is commonly used in the social sciences, market research, and other industries that
use large data sets.
Principal components analysis is commonly used as one step in a series of analyses and can be used to
reduce the number of variables and avoid multicollinearity. The other main advantage of PCA is that once
the patterns in the data are found, it can be compressed, i.e. by reducing the number of dimensions,
without much loss of information.
To avoid the effect of multicollinearity in the predictions, the variables were standardized and
dimensionality reduction was applied to the dataset using principal component analysis (PCA). The
variance that was observed in the direction of various components was plotted and is showed in Figure
25 below –
Figure 25: Variance distribution for Principal Components
The Principal component analysis method produced 15 principal components that could explain the data
almost as efficiently as the original variables did. However, the variance explained by the first few principal
components constitute a major amount of the total variance in the dataset. In order to choose the number
of principal components, ‘Elbow Method’ was used and the line plot for the same is shown in Figure 26
below –
21
Figure 26: Elbow curve for PCA to decide the number of PCs
It can be observed that after 3rd or 4th principal component, the contribution of variance explanation is
not significant and hence we will keep only 4 principal components for further analysis of the dataset as
they account for a cumulative of more than 80% of the total variance.
The main purpose of applying PCA on the dataset was to try reducing the effect of multicollinearity and
decrease the number of dimensions so that the model performance might get better. Hence, a logistic
regression was run on the new dataset with reduced dimensions and predictions were made for the same.
It was observed that after applying PCA, the logistic model got trained more efficiently and at the same
time its predictive power also got better.
The results of the confusion matrix are given in the Table 4 below –
Table 4: Logistic model after PCA_ in and out sample performance metrics
Model performance metric False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.1341 0.2862 0.1589
Out-sample 0.1653 0.2948 0.1732
A conclusion can be made that PCA reduced the dimensions and also the effect of multicollinearity in the
model performance and hence the misclassification rate is lower as compared to the normal logistic model
with more dimensions. Although performance on hold out sample is not as good as that on the training
sample, the difference is not significant.
22
5.2. Classification Tree
In a classification tree structure, each internal node denotes a test on an attribute, each branch represents
an outcome of the test, and leaf nodes represent classes. The top-most node in a tree is the root node.
CTs are applied when the response variable is qualitative or quantitative discrete. Classification trees
perform a classification of the observations on the basis of all explanatory variables and supervised by the
presence of the response variable. The segmentation process is typically carried out using only one
explanatory variable at a time. CTs are based on minimizing impurity, which refers to a measure of
variability of the response values of the observations. CTs can result in simple classification rules and can
handle the nonlinear and interactive effects of explanatory variables. But their sequential nature and
algorithmic complexity can make them depends on the observed data, and even a small change might
alter the structure of the tree. It is difficult to take a tree structure designed for one context and generalize
it for other contexts.
A classification tree model was fit on the training dataset and the results were analyzed. The classification
tree is shown in Figure 27 below –
Figure 27: Classification Tree model diagram
5.2.1. Complexity Parameter tuning and pruning –
The tree obtained with the default complexity parameter “cp” =0.01 has 5 nodes as shown in Figure 27.
However, it is necessary to tune the complexity parameter according to the error change with addition
of every node. A plot for change in relative error is shown in Figure 28, which gives the optimal value of
“cp” and hence the size of the tree.
23
Figure 28: Relative error vs Complexity Parameter
As, the value of “cp” increases, complexity if the tree decreases. It can be concluded that the relative
error increases after the size of tree is more than 3 (“cp”=0.05) and hence, it is not beneficial to keep the
size of the tree more than 3. The tree shown in Figure 27 is prune using the “cp” value as 0.05 based on
the observation from Figure 26 and the final tree is shown in Figure 29 below –
Figure 29: Final Classification Tree after pruning
24
The ROC curves for in-sample and out-sample predictions of Classification Tree model are shown in
Figure 30 and Figure 31 respectively.
Figure 30: ROC Curve for in-sample Classification Tree predictions Figure 31: ROC Curve for out-sample Classification Tree predictions
The in-sample and out-sample performance of the Classification Tree model is given in Table 5 below –
Table 5: Classification Tree model in and out sample performance metrics
Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.7284 0.2693 0.3288 0.2824
Out-sample 0.7304 0.2713 0.3378 0.2862
The Classification Tree model is able to predict the defaults for training dataset with 0.2824 error rate and
with 0.2862 error rate for test dataset. Also, the AUC for ROC curves pertaining in-sample is 0.7284 and
out sample predictions is around 0.7304. It can be concluded that Classification Tree is a good fit for the
data and shows a considerable prediction power.
5.2.2. Adaptive Boosting (AdaBoost)
Boosting is a method that makes maximum use of a classifier by improving its accuracy. The classifier
method is used as a subroutine to build an extremely accurate classifier in the training set. Boosting
applies the classification system repeatedly on the training data, but in each step the learning attention is
focused on different examples of this set using adaptive weights. Once the process has finished, the single
classifiers obtained are combined into a final, highly accurate classifier in the training set. The final
classifier therefore usually achieves a high degree of accuracy in the test set, as various authors have
shown both theoretically and empirically. Out of the several versions of boosting algorithms, the best
known for binary classification problems is AdaBoost.
25
It is worth to highlight that the boosting function allows quantifying the relative importance of the
predictor variables. Understanding a small individual tree can be easy. However, it is more difficult to
interpret the hundreds or thousands of trees used in the boosting ensemble. Therefore, to be able to
quantify the contribution of the predictor variables to the discrimination is a really important advantage.
The measure of importance takes into account the gain of the Gini index given by a variable in a tree and
the weight of this tree in the case of boosting.
The AdaBoost technique was applied on the dataset in this project and after hundred iterations and
adaptive weights, it output the importance of the variables in determining the binary output. The result
is shown in Figure 32 below.
Figure 32: Relative importance of each variable in the classification task
It can be observed that Boosting algorithm resulted in giving the maximum importance to the variable X6
- the repayment status in March, 2016. This is in concordance with the earlier normal tree structure and
also makes sense as the status of a credit card holder next month would depend a lot on his previous
month’s repayment status. However, not all the results of AdaBoost technique could be explained
theoretically. For the final tree model from AdaBoost, predictions were done for training as well as testing
sample and the results are delineated in Table 6 below –
Table 6: AdaBoosting Classification tree model_ in and out sample performance metrics
Model performance metric False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.0842 0.4342 0.1893
Out-sample 0.1607 0.2891 0.1678
26
It can be clearly observed that the model performance has increased a lot as compared to normal
classification tree when AdaBoost technique was applied to the same dataset. As explained earlier,
because of the unique technique that this method follows, it’s performance boosts up in the testing
dataset and concordant effects are observed in the results shown in Table 6. Even though the False
Positive Rate has increased for testing dataset, the more important metric – False Negative Rate has
decreased significantly; and this has led to an overall decrease in the error rate.
5.3. Artificial Neural Network
Artificial neural networks use non-linear mathematical equations to successively develop meaningful
relationships between input and output variables through a learning process. We applied back
propagation networks to classify data. A back propagation neural network uses a feed-forward topology
and supervised learning. The structure of back propagation networks is typically composed of an input
layer, one or more hidden layers, and an output layer, each consisting of several neurons. ANNs can easily
handle the non-linear and interactive effects of explanatory variables. The major drawback of ANNs is –
they cannot result in a simple probabilistic formula of classification.
An ANN black box model was fit to the training dataset and after 500 iterations, the model converged. As
this is a black box model, the details of the model cannot be shown here. However, the in-sample and
out-sample performance of the ANN model is given in Table 7 below –
Table 7: ANN model in and out sample performance metrics
Model performance metric False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.2980 0.5191 0.3467
Out-sample 0.3499 0.4402 0.3702
The ANN model performs poorly on this dataset and especially on the hold-out sample with 0.37 error
rate.
5.4. Linear Discriminant Analysis
Discriminant analysis, also known as Fisher’s rule, is another technique applied to the binary result of
response variable. DA is an alternative to logistic regression and is based on the assumptions that, for
each given class of response variable, the explanatory variables are distributed as a multivariate normal
distribution with a common variance–covariance matrix. The objective of Fisher’s rule is to maximize the
distance between different groups and to minimize the distance within each group. The pros and cons of
DA are similar to those of LR.
Hence, assuming the underlying explanatory variables are normally distributed, discriminant analysis
model was applied on the training dataset. To check the model in-sample and out-sample performance,
the response variable was predicted using the cut-off probability as 0.2 according to the traditional value
used by the company for default predictions. The ROC curves for in-sample and out-sample predictions of
full logistic model are shown in Figure 33 and Figure 34 respectively.
27
Figure 33: ROC Curve for in-sample LDA model predictions Figure 34: ROC Curve for out-sample LDA model predictions
The in-sample and out-sample performance of the full logistic model is given in 8 below –
Table 8: LDA model in and out sample performance metrics
Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.7723 0.1383 0.4549 0.2080
Out-sample 0.7697 0.1326 0.4461 0.2030
The linear discriminant analysis model is able to predict the defaults for training dataset with 0.2080 error
rate and with 0.2030 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and
out sample predictions is around 0.77. It can be concluded that LDA is a good fit for the data and shows a
considerable prediction power.
6. MODEL COMPARISON
We have built various models on the training dataset and checked their performance on both training and
well as testing dataset in predicting defaults. For defaults, False Negatives affect the business more than
False positives. So, lesser False Negatives are desired for the predictions. To compare performance of all
the models, a cost function was introduced with 5 times as much penalty for False Negatives as compared
to that for False Positives. The comparison and model performance summary is given in Table 9 below –
28
Table 9: Comparison of in and out sample metrics of all models
Model AUC FP FN Error Rate Cost
1. Logistic Regression In-sample 0.7719 0.2038 0.3849 0.2437 0.58308
Out-sample 0.7717 0.2016 0.3838 0.2425 0.58726
1.1 Logistic after PCA In-sample NA 0.1341 0.2862 0.1589 0.4164
Out-sample NA 0.1653 0.2948 0.1732 0.4327
2. Classification Tree In-sample 0.7284 0.2693 0.3288 0.2824 0.57225
Out-sample 0.7304 0.2713 0.3378 0.2862 0.58959
2.1 AdaBoost Classifier In-sample NA 0.0842 0.4342 0.1893 0.5196
Out-sample NA 0.1607 0.2891 0.1678 0.4273
3. ANN In-sample NA 0.2980 0.5191 0.3467 0.80442
Out-sample NA 0.3499 0.4402 0.3702 0.76563
4. LDA In-sample 0.7723 0.1383 0.4549 0.2080 0.66475
Out-sample 0.7697 0.1326 0.4461 0.2030 0.60377
It can be observed that although some methods (e.g. LDA) have lower misclassification rate, due to their
high False Negative rare and asymmetric cost function of the business problem, the overall cost is high.
ANN model is performing poorly in both in-sample and out-sample datasets with a very high cost value.
Logit Model and Classification Tree have almost the same cost function value. However, after applying
PCA to the dataset and using the new reduced dimensioned data for logistic regression, the results are
much better as compared to the normal logit regression. Similarly, AdaBoosting method also improves
the performance of the Classification tree by adaptive boosting technique. The error rate of logistic using
PCA is comparable to that of AdaBoosting but the False Negative rate is lesser in the training dataset,
making it better model. However, according to the business requirement, either Logistic Model (after PCA)
or AdaBoost Classifier could be used to predict the default credit card holders for the next month.
7. CONCLUSION
This exercise has enabled us to predict the customers that would likely get defaulted next month using
the past 6 months’ data for that credit card holder, viz. his payment and default history. Various classifiers
were built considering the problem statement and comparison was done based on the rates of False
Positives and False Negatives. It was observed that Logistic Regression model (after PCA) and AdaBoost
Classifier perform best amongst all the models and hence could be accepted.
It can also be said, our model may perform even better if we incorporate a few more variables for which
data is not readily available. For example, credit score, income, etc. would train the data better than what
we have now. Having said that, considering only past 6 months’ data for predicting defaults in the
immediate future is not the best way to solve the problem. Many extant data mining and machine learning
techniques are booming in the financial industry to predict credit card customer defaults.
29
8. REFERENCES –
1. http://www.goodfinancialcents.com/credit-card-default-debt-consequences-results/
2. http://www.bestcreditcardrates.com
3. http://www.philly.com/philly/blogs/inq-phillydeals/US-credit-card-borrowing-surges-more-
defaults-soon.html
4. https://www.federalreserve.gov/econresdata/
5. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007). “A Comparison of Decision Tree
Ensemble Creation Techniques.” IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(1), 173–180.
6. https://www.jstatsoft.org/index
7. http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/

More Related Content

What's hot

Default Prediction & Analysis on Lending Club Loan Data
Default Prediction & Analysis on Lending Club Loan DataDefault Prediction & Analysis on Lending Club Loan Data
Default Prediction & Analysis on Lending Club Loan Data
Deep Borkar
 
Case Study: Loan default prediction
Case Study: Loan default predictionCase Study: Loan default prediction
Case Study: Loan default prediction
ALTEN Calsoft Labs
 
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsBuild Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Neo4j
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
wgyn
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval Venkata Reddy Konasani
 
Loan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptxLoan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptx
BhoirRitesh19ET5008
 
Capstone Project.pptx
Capstone Project.pptxCapstone Project.pptx
Capstone Project.pptx
surendrapushpupadhya
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approach
Eslam Nader
 
Default payment prediction system
Default payment prediction systemDefault payment prediction system
Default payment prediction system
Ashish Arora
 
Data analytics in banking sector
Data analytics in banking sectorData analytics in banking sector
Data analytics in banking sector
SnigdhaGupta23
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection ML
MaatougSelim
 
AI powered decision making in banks
AI powered decision making in banksAI powered decision making in banks
AI powered decision making in banks
Pankaj Baid
 
Credit Scoring
Credit ScoringCredit Scoring
Credit ScoringMABSIV
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
Sandeep Garg
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
vineeta vineeta
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research Paper
Garvit Burad
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model final
Ritu Sarkar
 
Fraud Analytics
Fraud AnalyticsFraud Analytics
Fraud Analytics
Big Data Colombia
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine Learning
Scaleway
 

What's hot (20)

Default Prediction & Analysis on Lending Club Loan Data
Default Prediction & Analysis on Lending Club Loan DataDefault Prediction & Analysis on Lending Club Loan Data
Default Prediction & Analysis on Lending Club Loan Data
 
Case Study: Loan default prediction
Case Study: Loan default predictionCase Study: Loan default prediction
Case Study: Loan default prediction
 
Build Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and GraphsBuild Intelligent Fraud Prevention with Machine Learning and Graphs
Build Intelligent Fraud Prevention with Machine Learning and Graphs
 
Detecting fraud with Python and machine learning
Detecting fraud with Python and machine learningDetecting fraud with Python and machine learning
Detecting fraud with Python and machine learning
 
Model building in credit card and loan approval
Model building in credit card and loan approval Model building in credit card and loan approval
Model building in credit card and loan approval
 
Loan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptxLoan Prediction System Using Machine Learning.pptx
Loan Prediction System Using Machine Learning.pptx
 
Capstone Project.pptx
Capstone Project.pptxCapstone Project.pptx
Capstone Project.pptx
 
Loan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approachLoan approval prediction based on machine learning approach
Loan approval prediction based on machine learning approach
 
Default payment prediction system
Default payment prediction systemDefault payment prediction system
Default payment prediction system
 
Data analytics in banking sector
Data analytics in banking sectorData analytics in banking sector
Data analytics in banking sector
 
Fraud detection ML
Fraud detection MLFraud detection ML
Fraud detection ML
 
AI powered decision making in banks
AI powered decision making in banksAI powered decision making in banks
AI powered decision making in banks
 
Credit scoring
Credit scoringCredit scoring
Credit scoring
 
Credit Scoring
Credit ScoringCredit Scoring
Credit Scoring
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Credit card fraud detection
Credit card fraud detectionCredit card fraud detection
Credit card fraud detection
 
Credit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research PaperCredit Card Fraudulent Transaction Detection Research Paper
Credit Card Fraudulent Transaction Detection Research Paper
 
Credit risk scoring model final
Credit risk scoring model finalCredit risk scoring model final
Credit risk scoring model final
 
Fraud Analytics
Fraud AnalyticsFraud Analytics
Fraud Analytics
 
Fraud detection with Machine Learning
Fraud detection with Machine LearningFraud detection with Machine Learning
Fraud detection with Machine Learning
 

Similar to Predicting Credit Card Defaults using Machine Learning Algorithms

Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
Hirak Sen Roy
 
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET Journal
 
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
mlaij
 
MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...
AmarnathVenkataraman
 
Transaction_Scoring - WVK MasterCard
Transaction_Scoring - WVK MasterCardTransaction_Scoring - WVK MasterCard
Transaction_Scoring - WVK MasterCardWestley Koenen
 
fast publication journals
fast publication journalsfast publication journals
fast publication journals
rikaseorika
 
Busting Credit Score Myths
Busting Credit Score MythsBusting Credit Score Myths
Busting Credit Score Myths
Equifax
 
K-MODEL PPT.pptx
K-MODEL PPT.pptxK-MODEL PPT.pptx
K-MODEL PPT.pptx
PritchardMatambo1
 
Next Edge Capital Specialty Finance Report
Next Edge Capital Specialty Finance ReportNext Edge Capital Specialty Finance Report
Next Edge Capital Specialty Finance Reportleesont
 
Big East Bank case
Big East Bank caseBig East Bank case
Big East Bank case
Cynthia Antonio
 
Fiserv instant issue_point_of_view
Fiserv instant issue_point_of_viewFiserv instant issue_point_of_view
Fiserv instant issue_point_of_view
FSS Corpcomm
 
Blog-Embryonic Opportunities For Predictive Analytics In Australia
Blog-Embryonic Opportunities For Predictive Analytics In AustraliaBlog-Embryonic Opportunities For Predictive Analytics In Australia
Blog-Embryonic Opportunities For Predictive Analytics In AustraliaArup Das
 
Understanding How Your Fair Issac Credit Scores (FICO) Scores and How They Work
Understanding How Your Fair Issac Credit Scores (FICO) Scores and How They WorkUnderstanding How Your Fair Issac Credit Scores (FICO) Scores and How They Work
Understanding How Your Fair Issac Credit Scores (FICO) Scores and How They Work
Absolute Home Mortgage Corp.
 
EXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINA
EXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINAEXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINA
EXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINA
IJMIT JOURNAL
 
Examining impacts of big data analytics on consumer finance a case of china
Examining impacts of big data analytics on consumer finance a case of chinaExamining impacts of big data analytics on consumer finance a case of china
Examining impacts of big data analytics on consumer finance a case of china
IJMIT JOURNAL
 
SOLUTIONS FOR ANALYTICS POWERED BANKING
SOLUTIONS FOR ANALYTICS POWERED BANKINGSOLUTIONS FOR ANALYTICS POWERED BANKING
SOLUTIONS FOR ANALYTICS POWERED BANKING
Rolta
 
Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...
Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...
Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...
Mitek
 
P2P Lending Business Research by Artivatic.ai
P2P Lending Business Research by Artivatic.aiP2P Lending Business Research by Artivatic.ai
P2P Lending Business Research by Artivatic.ai
Artivatic.ai
 
Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some credit
pragativbora
 
Data Science Use Cases in The Banking and Finance Sector
Data Science Use Cases in The Banking and Finance SectorData Science Use Cases in The Banking and Finance Sector
Data Science Use Cases in The Banking and Finance Sector
SofiaCarter4
 

Similar to Predicting Credit Card Defaults using Machine Learning Algorithms (20)

Consumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random ForestConsumer Credit Scoring Using Logistic Regression and Random Forest
Consumer Credit Scoring Using Logistic Regression and Random Forest
 
IRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank LoansIRJET- Prediction of Credit Risks in Lending Bank Loans
IRJET- Prediction of Credit Risks in Lending Bank Loans
 
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
DEVELOPING PREDICTION MODEL OF LOAN RISK IN BANKS USING DATA MINING
 
MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...MSc research project report - Optimisation of Credit Rating Process via Machi...
MSc research project report - Optimisation of Credit Rating Process via Machi...
 
Transaction_Scoring - WVK MasterCard
Transaction_Scoring - WVK MasterCardTransaction_Scoring - WVK MasterCard
Transaction_Scoring - WVK MasterCard
 
fast publication journals
fast publication journalsfast publication journals
fast publication journals
 
Busting Credit Score Myths
Busting Credit Score MythsBusting Credit Score Myths
Busting Credit Score Myths
 
K-MODEL PPT.pptx
K-MODEL PPT.pptxK-MODEL PPT.pptx
K-MODEL PPT.pptx
 
Next Edge Capital Specialty Finance Report
Next Edge Capital Specialty Finance ReportNext Edge Capital Specialty Finance Report
Next Edge Capital Specialty Finance Report
 
Big East Bank case
Big East Bank caseBig East Bank case
Big East Bank case
 
Fiserv instant issue_point_of_view
Fiserv instant issue_point_of_viewFiserv instant issue_point_of_view
Fiserv instant issue_point_of_view
 
Blog-Embryonic Opportunities For Predictive Analytics In Australia
Blog-Embryonic Opportunities For Predictive Analytics In AustraliaBlog-Embryonic Opportunities For Predictive Analytics In Australia
Blog-Embryonic Opportunities For Predictive Analytics In Australia
 
Understanding How Your Fair Issac Credit Scores (FICO) Scores and How They Work
Understanding How Your Fair Issac Credit Scores (FICO) Scores and How They WorkUnderstanding How Your Fair Issac Credit Scores (FICO) Scores and How They Work
Understanding How Your Fair Issac Credit Scores (FICO) Scores and How They Work
 
EXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINA
EXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINAEXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINA
EXAMINING IMPACTS OF BIG DATA ANALYTICS ON CONSUMER FINANCE: A CASE OF CHINA
 
Examining impacts of big data analytics on consumer finance a case of china
Examining impacts of big data analytics on consumer finance a case of chinaExamining impacts of big data analytics on consumer finance a case of china
Examining impacts of big data analytics on consumer finance a case of china
 
SOLUTIONS FOR ANALYTICS POWERED BANKING
SOLUTIONS FOR ANALYTICS POWERED BANKINGSOLUTIONS FOR ANALYTICS POWERED BANKING
SOLUTIONS FOR ANALYTICS POWERED BANKING
 
Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...
Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...
Broadening Bill Payment Adoption With Photo-Based Payment: Quantifying the Be...
 
P2P Lending Business Research by Artivatic.ai
P2P Lending Business Research by Artivatic.aiP2P Lending Business Research by Artivatic.ai
P2P Lending Business Research by Artivatic.ai
 
Predicting Delinquency-Give me some credit
Predicting Delinquency-Give me some creditPredicting Delinquency-Give me some credit
Predicting Delinquency-Give me some credit
 
Data Science Use Cases in The Banking and Finance Sector
Data Science Use Cases in The Banking and Finance SectorData Science Use Cases in The Banking and Finance Sector
Data Science Use Cases in The Banking and Finance Sector
 

Recently uploaded

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 

Predicting Credit Card Defaults using Machine Learning Algorithms

  • 1. 1 MS-CAPSTONE (BANA 6064) CARL H. LINDNER COLLEGE OF BUSINESS SUMMER 2016 PREDICTING CREDIT CARD DEFAULTS Understanding the concept of default, why it happens and the components used to predict the default of credit card holders Submitted in Partial Fulfillment for the Requirements for the Degree of Master of Science in Business Analytics TO: Prof. Yichen Qin (1st Reader) Prof. Peng Wang (2nd Reader) BY: Sagar Vinaykumar Tupkar tupkarsr@mail.uc.edu M08773948
  • 2. 2 ABSTRACT Credit Card defaults poses a major problem to all the major financial service providers today as they have to invest a lot of money in collection strategy, which again is uncertain. The analysts in financial industry today have achieved great success in plotting a method to predict the default of credit card holder based on various factors. This study aims at using the previous 6 months’ data of the customer to predict whether the customer will go default in the next month by various statistical and data mining techniques and building different models for the same. The exploratory data analysis part is also important to check the distributions and patterns followed by the customers which eventually lead to default. Out of the four models built, Logistic Regression after doing Principal Component Analysis and Adaptive Boosting Classifier performed the best in predicting defaults with around 83% accuracy and minimizing the penalty to the company. This study gave list of important variables that affects the model and should be considered for predicting defaults. Even though the accuracy of the predictions is good, further research and powerful techniques can potentially enhance the results and bring a revolution in the credit card industry.
  • 3. 3 Contents ABSTRACT..................................................................................................................................................2 1. INTRODUCTION.................................................................................................................................4 1.1. Credit-Card Default Definition – ...............................................................................................4 1.2. Background and Current Situation of Credit Card Defaults –...................................................4 2. OBJECTIVE OF THE STUDY –..............................................................................................................4 3. DATA .................................................................................................................................................5 4. EXPLORATARY DATA ANALYSIS.........................................................................................................9 4.1. Gender based Distribution:.....................................................................................................10 4.2. Education based Distribution:.................................................................................................10 4.3. Age based distribution:...........................................................................................................11 4.4. Marital Status based Distribution:..........................................................................................11 4.5. Credit-Line based distribution: ...............................................................................................12 4.6. Distribution of Payment statistics in October 2015................................................................13 4.7. Distribution of Payment statistics in November 2015............................................................13 4.8. Distribution of Payment statistics in December 2015 ............................................................14 4.9. Distribution of Payment statistics in January 2016.................................................................15 4.10. Distribution of Payment statistics in February 2016...........................................................15 4.11. Distribution of Payment statistics in March 2016...............................................................16 5. MODEL PREPARATION ....................................................................................................................17 5.1. Logistic Regression Model.......................................................................................................17 5.2. Classification Tree ...................................................................................................................22 5.3. Artificial Neural Network ........................................................................................................26 5.4. Linear Discriminant Analysis ...................................................................................................26 6. MODEL COMPARISON.....................................................................................................................27 7. CONCLUSION...................................................................................................................................28 8. REFERENCES –.................................................................................................................................29
  • 4. 4 1. INTRODUCTION 1.1. Credit-Card Default Definition – When a customer applies for and receive a credit card, it becomes a huge responsibility for customer as well as the credit card issuing company. The credit card company evaluates the customer’s credit worthiness and gives him/her a line of credit that they feel the customer can be responsible for. While most people will use their card to make purchases and then diligently make payments on what they charge, there are some people who, for one reason or another, do not keep up on their payments and eventually go into credit card default. Credit card default is the term used to describe what happens when a credit card user makes purchases by charging them to their credit card and then they do not pay their bill. It can occur when one payment is more than 30 days past due, which may raise your interest rate. Most of the time, the term default is used informally when the credit card payment is more than 60 days past due. A default has a negative impact on the credit report and most likely lead to higher interest rates on future borrowing. 1.2. Background and Current Situation of Credit Card Defaults – The U.S. economy is growing at just 2.5% a year, but credit card lending is rising more than twice as fast: 5% over year-earlier levels each month since last fall, accelerating to 6% in March 2016 and April 2016, says Federal Reserve data. That's the fastest card debt has grown since card lending fell in the 2009 recession and since Americans aren't earning that much more, won't delinquencies, charge-offs and bankruptcies be rising in another year or two? As a matter of fact, for the 9 dominant U.S. credit card banks, which control 70% of the Visa-MasterCard- American Express-Discover-Chinapay market in the U.S., average charge-offs in early 2016 were 3.13% of annualized average loans, "down from a peak of 9.9% in 2009." In recent years, the credit card issuers are facing the cash and credit card debt crisis as they have been over-issuing cash and credit cards to unqualified applicants, in order to increase their market share. At the same time, most cardholders, irrespective of their repayment ability, overused credit card for consumption and accumulated heavy credit and cash–card debts. The crisis is an omen for the blow to consumer finance confidence and it is a big challenge for both banks and cardholders. 2. OBJECTIVE OF THE STUDY – This project is an attempt to identify credit card customers who are more likely to default in the coming month. A lot of credit card issuing companies are working on predictive models which would help them predict the payment status of the customer ahead of time using the customer’s credit score, credit history, payment history and other factors. A lot of statistical models to predict delinquency are extant in the
  • 5. 5 financial industry today, however, as a famous quote goes, “All models are wrong, only some are useful”, our attempt to build another predictive model has its own small impact in the research today. This project is aimed at using customer’s personal and financial information like credit line, age, repayment and delinquency history for the past 6 months to predict the probability of the particular customer to become default next month. Many statistical and data mining techniques will be used to build a binary predictive model. If the credit card issuing companies can effectively predict the imminent default of customers beforehand, it will help them to pursue targeted customers and take calculated efforts to avoid the default, to overcome future losses efficiently. 3. DATA As mentioned earlier, this project will use customer’s personal and financial information like credit line, age, repayment and delinquency history for the past 6 months to predict the probability of the particular customer to become default next month. The data was provided by one of the famous credit card issuing banks of the USA and contains proprietary information about the customers (e.g. account numbers, which have been masked). The data, in any sense, does not directly reveal identity of any individual or provide information that could be decrypted to connect to an individual. In this project, the plan is to predict the probability of credit-card holders to go default in the next month by using payment data from October 2015 to March 2016. Among the total 30,000 observations, 6636 observations (22.12%) are the cardholders with default payment. To determine the binary variable – default payment in April 2016 (Yes = 1, No = 0), as the response variable, the following 23 variables would be used as explanatory variables: 1. X1: Amount of the given credit (USD): it includes both the individual consumer credit and his/her family (supplementary) credit. Including this variable in the study is important as the credit line of a customer is a good indicator of the financial credit score of the customer. Using this variable will help the model predict defaults more effectively. 2. X2: Gender (1 = male; 2 = female) It might be useful to see whether the gender of the customer is in any way related to his/her probability of default. The distribution of defaults based on gender will be an interesting chart to look at. 3. X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others It might be useful to see whether the education level of the customer is in any way related to his/her probability of default. The distribution of defaults based on education level will be an interesting chart to look at.
  • 6. 6 4. X4: Marital status (1 = married; 2 = single; 3 = others) It might be useful to see whether the marital status of the customer is in any way related to his/her probability of default. The distribution of defaults based on marital status will be an interesting chart to look at. 5. X5: Age (year) It might be useful to see whether the gender of the customer is in any way related to his/her probability of default. The distribution of defaults based on age buckets will be an interesting chart to look at. 6. X6 - X11: History of past payment. Customers’ past monthly payment records (from October 2015 to March, 2016) were tracked and used in the dataset as follows: X6 = the repayment status in March, 2016; X7 = the repayment status in February, 2016; . . .; X11 = the repayment status in October, 2015. The measurement scale for the repayment status is: -2 = Minimum due payment scheduled for 60 days -1 = Minimum due payment scheduled for 30 days 0 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months and above; This information is very crucial as it directly provides the payment status of the customer for the past 6 months. Using these variables will train the model efficiently to predict defaults. 7. X12-X17: Amount of bill statement (USD) X12 = amount of bill statement in March, 2016; X13 = amount of bill statement in February, 2016; . . .; X17 = amount of bill statement in October, 2015.
  • 7. 7 Actual bill statements of the customers for the past 6 months would give a quantitative estimate for the amount spent by the customer using the credit card. 8. X18-X23: Amount of previous payment (USD) X18 = amount paid in March, 2016; X19 = amount paid in February, 2016; . . .; X23 = amount paid in October, 2015. Amount of USD paid by the customers in past 6 months would give the repayment ability of the customer and the pattern for payment could be used to train the model efficiently. The variable names, example, data type and description is provided below in Figure 1. Column name Variable Names Example Data Type Description ID ID 23 Integer Masked Account numbers of Customers Y default payment next month 1 Binary Binary variable (1,0) with 1 being customer defaults in next month X1 LIMIT_BAL 2170 Continuous numeric Credit line of the customer X2 SEX 2 Factor Gender of the customer 1= male 2=female X3 EDUCATION 2 Factor Education (1 = graduate school; 2 = university; 3 = high school; 4 = others) X4 MARRIAGE 2 Factor Marital status (1 = married; 2 = single; 3 = others) X5 AGE 26 Integer Age (year) X6 PAY_1 2 Factor repayment status in March, 2016 X7 PAY_2 0 Factor repayment status in February, 2016 X8 PAY_3 0 Factor repayment status in January, 2016 X9 PAY_4 2 Factor repayment status in December, 2015 X10 PAY_5 2 Factor repayment status in November, 2015 X11 PAY_6 2 Factor repayment status in October, 2015 X12 BILL_AMT1 1273.70 Continuous numeric amount of bill statement in March, 2016 (in USD) X13 BILL_AMT2 1315.80 Continuous numeric amount of bill statement in February, 2016 (in USD) X14 BILL_AMT3 1395.62 Continuous numeric amount of bill statement in January, 2016 (in USD) X15 BILL_AMT4 1364.19 Continuous numeric amount of bill statement in December, 2015 (in USD) X16 BILL_AMT5 1454.06 Continuous numeric amount of bill statement in November, 2015 (in USD) X17 BILL_AMT6 1426.37 Continuous numeric amount of bill statement in October, 2015 (in USD) X18 PAY_AMT1 62.22 Continuous numeric amount paid in March, 2016 (in USD) X19 PAY_AMT2 111.04 Continuous numeric amount paid in February, 2016 (in USD) X20 PAY_AMT3 0.00 Continuous numeric amount paid in January, 2016 (in USD) X21 PAY_AMT4 111.63 Continuous numeric amount paid in December, 2015 (in USD) X22 PAY_AMT5 0.00 Continuous numeric amount paid in November, 2015 (in USD) X23 PAY_AMT6 56.42 Continuous numeric amount paid in October, 2015 (in USD) Figure 1: Data Dictionary
  • 8. 8 In order to get the gist of the data, Figure 2 shows a snapshot of a subset of observations in the dataset, Figure 3 provides the information like datatype and levels etc. about the variables and Figure 4 provides a statistical summary of the data – Figure 3: Description of variables in the dataset Figure 2: Snapshot of the top 15 observations of the dataset
  • 9. 9 Figure 4: Summary of the variables in the dataset 4. EXPLORATARY DATA ANALYSIS From the description of the data above, it can be concluded that the data did not have any null values for any of the variables. We will start by doing an initial exploratory data analysis by looking at the distribution of different variables with Y=0 and Y=1; so that the behavior of default and non-default customers can be analyzed. The value of the variables was aggregated and the total was plotted against the number of customers; hence a frequency chart was prepared for the class variables and insights were drawn from the visualizations. The results were separately presented for Y=0 and Y=1 i.e. default and non-default customers (red being default and green being non-default) so that the analysis becomes easier.
  • 10. 10 4.1. Gender based Distribution: The bar chart was plotted for distribution of customers based on their gender and is shown in Figure 5. The result is separated by default and non-default customers (red being default and green being non- default). It can be observed that out of the 12k male credit card holders, 24.17% of the customers were default whereas out of the 18k female credit card holders, 20.78% of the customers were default. Although the total number of female customers are more than male customers, the percentage of male default customers are more than that of the female customers. 4.2. Education based Distribution: The bar chart was plotted for distribution of customers based on their education and is shown in Figure 6. The result is separated by default and non-default customers (red being default and green being non- default). Figure 6: Distribution by Education Figure 5: Distribution by Gender
  • 11. 11 It can be observed that most of the credit card holders (~14k) are University pass-outs followed by Graduates and High School pass-outs. Although the percentage of default doesn’t vary significantly amongst these customers based on their education, it is worth noticing that 25.16% of the high school pass-out customers get defaulted while the number decreased to 23.73% for University and 19.23% for Graduate School. 4.3. Age based distribution: The customers were categorized into bins and bar chart plotted for distribution of customers based on their age is shown in Figure 7. The result is separated by default and non-default customers (red being default and green being non-default). Figure 7: Distribution by Age It can be observed that the credit card holders who are aged less than 25 years have the maximum (~27.20%) of default proportion followed by the age group 45-60 years (25.13%). While the number of customers in age group 25-35 years are maximum in numbers, their default proportion (~20.30%) is pretty decent. 4.4. Marital Status based Distribution: The bar chart was plotted for distribution of customers based on their marital status and is shown in Figure 8. The result is separated by default and non-default customers (red being default and green being non- default).
  • 12. 12 Figure 8: Distribution by Marital Status It can be observed that the credit card holders who are married have the maximum (~23.47%) of default proportion as compared to ~20.93% of Single customers. Although married customers are lesser in number than single customers, the default proportion is higher amongst married credit card holders. 4.5. Credit-Line based distribution: The customers were categorized into bins and bar chart plotted for distribution of customers based on their credit line is shown in Figure 9. The result is separated by default and non-default customers (red being default and green being non-default). Figure 9: Distribution by Credit Line
  • 13. 13 It can be observed that for the customers whose credit line is between $1000-$5000 are maximum in number and have 2nd highest default proportion (24.46%) after customers with credit line between $500- $1000 which is alarmingly high (35.30%) 4.6. Distribution of Payment statistics in October 2015 For the month of October 2015, the number of credit card holders were distributed based on their repayment history code above and is shown in Figure 10. Also, the total Statement Bill amount and the amount paid by the credit card holder was divided into default and non-default customers to analyze their situation separately and is shown below in Figure 11. Figure 10: Repayment status distribution in Oct'15 Figure 11:Bill and Paid amount distribution in Oct'15 It can be observed that in the month of October 2015, while most of the customers paid their loan duly, almost 50% of the customers who had a payment delay for 2 months went default in April 2016. On the other hand, customers who went default in April 2015 paid only ~9% of the bill statement in October 2015 as opposed to ~14% by non-default customers. 4.7. Distribution of Payment statistics in November 2015 For the month of November 2015, the number of credit card holders were distributed based on their repayment history code above and is shown Figure 12. Also, the total Statement Bill amount and the amount paid by the credit card holder was divided into default and non-default customers to analyze their situation separately and is shown below in Figure 13.
  • 14. 14 Figure 12: Repayment status distribution in Nov'15 Figure 13: Bill and Paid amount distribution in Nov'15 It can be observed that in the month of November 2015, while most of the customers paid their loan duly, almost 54% of the customers who had a payment delay for 2 months went default in April 2016. On the other hand, customers who went default in April 2015 paid only ~8% of the bill statement in November 2015 as opposed to ~13% by non-default customers. 4.8. Distribution of Payment statistics in December 2015 For the month of November 2015, the number of credit card holders were distributed based on their repayment history code above and is shown Figure 14. Also, the total Statement Bill amount and the amount paid by the credit card holder was divided into default and non-default customers to analyze their situation separately and is shown below in Figure 15. Figure 14: Repayment status distribution in Dec'15 Figure 15: Bill and Paid amount distribution in Dec'15
  • 15. 15 It can be observed that in the month of December 2015, while most of the customers paid their loan duly, almost 52% of the customers who had a payment delay for 2 months went default in April 2016. On the other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in December 2015 as opposed to ~12% by non-default customers. 4.9. Distribution of Payment statistics in January 2016 For the month of January 2016, the number of credit card holders were distributed based on their repayment history code above and is shown in Figure 16. Also, the total Statement Bill amount and the amount paid by the credit card holder was divided into default and non-default customers to analyze their situation separately and is shown below in Figure 17. Figure 16: Repayment status distribution in Jan'16 Figure 17: Bill and Paid amount distribution in Jan'16 It can be observed that in the month of January 2016, while most of the customers paid their loan duly, almost 52% of the customers who had a payment delay for 2 months went default in April 2016. On the other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in January 2016 as opposed to ~12% by non-default customers. 4.10. Distribution of Payment statistics in February 2016 For the month of February 2016, the number of credit card holders were distributed based on their repayment history code above and is shown in Figure 18. Also, the total Statement Bill amount and the amount paid by the credit card holder was divided into default and non-default customers to analyze their situation separately and is shown below in Figure 19.
  • 16. 16 Figure 18: Repayment status distribution in Feb'16 Figure 19: Bill and Paid amount distribution in Feb'16 It can be observed that in the month of February 2016, while most of the customers paid their loan duly, almost 56% of the customers who had a payment delay for 2 months went default in April 2016. On the other hand, customers who went default in April 2015 paid only ~7% of the bill statement in February 2016 as opposed to ~13% by non-default customers. 4.11. Distribution of Payment statistics in March 2016 For the month of March 2016, the number of credit card holders were distributed based on their repayment history code above and is shown in Figure 20. Also, the total Statement Bill amount and the amount paid by the credit card holder was divided into default and non-default customers to analyze their situation separately and is shown below in Figure 21. Figure 20: Repayment status distribution in Mar'16 Figure 21: Bill and Paid amount distribution in Mar'16
  • 17. 17 It can be observed that in the month of March 2016, while most of the customers paid their loan duly, almost 70% of the customers who had a payment delay for 2 months and 34% of the customers who had a payment delay for 1 month went default in April 2016. On the other hand, customers who went default in April 2015 paid only ~7% of the bill statement in March 2016 as opposed to ~12% by non-default customers. 5. MODEL PREPARATION The aim of this exercise is to build a model using the variables explained in the earlier section to predict the credit card holders which would go default next month. The data that would be used to train the model would be past 6 months financial, delinquency and payment history. Hence, in order to build the model, we have divided the data into training (80%) and testing (20%) subsets. Multiple classifiers were used to build the model using the training dataset which contained 24,000 observations and were compared based on the various model performance metrics. We will go through each model separately and discuss the scope, performance and pros-cons related to every classifier method. 5.1. Logistic Regression Model Logistic regression can be considered a special case of linear regression models. However, the binary response variable violates normality assumptions of general regression models. A logistic regression model specifies that an appropriate function of the fitted probability of the event is a linear function of the observed values of the available explanatory variables. The major advantage of this approach is that it can produce a simple probabilistic formula of classification. The weaknesses are that LR cannot properly deal with the problems of non-linear and interactive effects of explanatory variables. A logistic regression model was fit on the training dataset using all the variables and the summary of the model can be seen in Table 1 below – Table 1: Logistic full model summary Logistic Regression AIC Null Deviance Residual Deviance Model Summary 20979 25314 20815 It was observed that some of the dummy variables that were created because of the presence of class/nominal variables in the dataset were not significant in the above full logistic regression model and hence a stepwise variable selection method for regression was performed. 5.1.1. Stepwise Variable Selection Method By performing stepwise variable selection, it was observed that some of the variables were omitted because they were insignificant in the full model. Finally, the new model was – Y ~ X1 + X2 + X3 + X4 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X18 + X19 + X20 + X22
  • 18. 18 It can be seen that the variables X5, X14, X15, X16, X17, X21 and X23 were omitted from the model. The summary of the new model can be seen in Table 2 below – Table 2: Logistic stepwise model summary Logistic Stepwise Regression AIC Null Deviance Residual Deviance Model Summary 20970 25314 20820 The variables that were omitted viz. age, bill statement amounts and paid amounts of few months are important according to business knowledge and are highly recommended to be kept in the model to be used. Moreover, even after removing these variables, the AIC of the model hasn’t decreased significantly. 5.1.2. LASSO variable selection – In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or LASSO) is a regression analysis method that performs both variable selection and regularization in order to avoid overfitting, enhance the prediction accuracy and interpretability of the statistical model it produces. In this project, 5-fold cross validation method was used to select the Tuning Parameter (lambda) that gets inside the LASSO optimization problem. The entire data was used to do 5-fold cross validation and LASSO variable selection and the behavior of binomial distribution for different tuning parameter- lambda was plotted. The optimum value of lambda is given by the vertical line in the plot shown below. Hence the value of tuning parameter, lambda is around 0.004. Figure 22: Binomial Deviance plot to choose the tuning parameter-lambda Now, using this lambda, variable selection was done using LASSO variable selection method. As far as the Null Deviance of the model is concerned, it was 31704.85 and hence because of the higher value as compared to full logistic and stepwise model, this model couldn’t be used.
  • 19. 19 After comparing the three different versions of logistic model, it was concluded that the full logistic model has better parameter values than the other two. Hence, we will use the full logistic regression model instead of the reduced model or LASSO fit for further analysis. To check the model in-sample and out-sample performance, the response variable was predicted using the cut-off probability as 0.2 according to the traditional value used by the company for default predictions. The ROC curves for in-sample and out-sample predictions of full logistic model are shown in Figure 23 and Figure 24 respectively. Figure 23: ROC Curve for in-sample Logistic model predictions Figure 24: ROC Curve for out-sample Logistic model predictions The in-sample and out-sample performance of the full logistic model is given in Table 3 below – Table 3: Logistic full model in and out sample performance metrics Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate In-sample 0.7719 0.2038 0.3849 0.2437 Out-sample 0.7717 0.2016 0.3838 0.2425 The logistic full regression model is able to predict the defaults for training dataset with 0.2437 error rate and with 0.2425 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and out sample predictions is around 0.77. It can be concluded that logistic regression is a good fit for the data and shows a considerable prediction power.
  • 20. 20 5.1.3. Principal Component Analysis Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables, called "principal components", from a large set of data. The goal of principal components analysis is to explain the maximum amount of variance with the fewest number of principal components. Principal components analysis is commonly used in the social sciences, market research, and other industries that use large data sets. Principal components analysis is commonly used as one step in a series of analyses and can be used to reduce the number of variables and avoid multicollinearity. The other main advantage of PCA is that once the patterns in the data are found, it can be compressed, i.e. by reducing the number of dimensions, without much loss of information. To avoid the effect of multicollinearity in the predictions, the variables were standardized and dimensionality reduction was applied to the dataset using principal component analysis (PCA). The variance that was observed in the direction of various components was plotted and is showed in Figure 25 below – Figure 25: Variance distribution for Principal Components The Principal component analysis method produced 15 principal components that could explain the data almost as efficiently as the original variables did. However, the variance explained by the first few principal components constitute a major amount of the total variance in the dataset. In order to choose the number of principal components, ‘Elbow Method’ was used and the line plot for the same is shown in Figure 26 below –
  • 21. 21 Figure 26: Elbow curve for PCA to decide the number of PCs It can be observed that after 3rd or 4th principal component, the contribution of variance explanation is not significant and hence we will keep only 4 principal components for further analysis of the dataset as they account for a cumulative of more than 80% of the total variance. The main purpose of applying PCA on the dataset was to try reducing the effect of multicollinearity and decrease the number of dimensions so that the model performance might get better. Hence, a logistic regression was run on the new dataset with reduced dimensions and predictions were made for the same. It was observed that after applying PCA, the logistic model got trained more efficiently and at the same time its predictive power also got better. The results of the confusion matrix are given in the Table 4 below – Table 4: Logistic model after PCA_ in and out sample performance metrics Model performance metric False Positive Rate False Negative Rate Misclassification Rate In-sample 0.1341 0.2862 0.1589 Out-sample 0.1653 0.2948 0.1732 A conclusion can be made that PCA reduced the dimensions and also the effect of multicollinearity in the model performance and hence the misclassification rate is lower as compared to the normal logistic model with more dimensions. Although performance on hold out sample is not as good as that on the training sample, the difference is not significant.
  • 22. 22 5.2. Classification Tree In a classification tree structure, each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes. The top-most node in a tree is the root node. CTs are applied when the response variable is qualitative or quantitative discrete. Classification trees perform a classification of the observations on the basis of all explanatory variables and supervised by the presence of the response variable. The segmentation process is typically carried out using only one explanatory variable at a time. CTs are based on minimizing impurity, which refers to a measure of variability of the response values of the observations. CTs can result in simple classification rules and can handle the nonlinear and interactive effects of explanatory variables. But their sequential nature and algorithmic complexity can make them depends on the observed data, and even a small change might alter the structure of the tree. It is difficult to take a tree structure designed for one context and generalize it for other contexts. A classification tree model was fit on the training dataset and the results were analyzed. The classification tree is shown in Figure 27 below – Figure 27: Classification Tree model diagram 5.2.1. Complexity Parameter tuning and pruning – The tree obtained with the default complexity parameter “cp” =0.01 has 5 nodes as shown in Figure 27. However, it is necessary to tune the complexity parameter according to the error change with addition of every node. A plot for change in relative error is shown in Figure 28, which gives the optimal value of “cp” and hence the size of the tree.
  • 23. 23 Figure 28: Relative error vs Complexity Parameter As, the value of “cp” increases, complexity if the tree decreases. It can be concluded that the relative error increases after the size of tree is more than 3 (“cp”=0.05) and hence, it is not beneficial to keep the size of the tree more than 3. The tree shown in Figure 27 is prune using the “cp” value as 0.05 based on the observation from Figure 26 and the final tree is shown in Figure 29 below – Figure 29: Final Classification Tree after pruning
  • 24. 24 The ROC curves for in-sample and out-sample predictions of Classification Tree model are shown in Figure 30 and Figure 31 respectively. Figure 30: ROC Curve for in-sample Classification Tree predictions Figure 31: ROC Curve for out-sample Classification Tree predictions The in-sample and out-sample performance of the Classification Tree model is given in Table 5 below – Table 5: Classification Tree model in and out sample performance metrics Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate In-sample 0.7284 0.2693 0.3288 0.2824 Out-sample 0.7304 0.2713 0.3378 0.2862 The Classification Tree model is able to predict the defaults for training dataset with 0.2824 error rate and with 0.2862 error rate for test dataset. Also, the AUC for ROC curves pertaining in-sample is 0.7284 and out sample predictions is around 0.7304. It can be concluded that Classification Tree is a good fit for the data and shows a considerable prediction power. 5.2.2. Adaptive Boosting (AdaBoost) Boosting is a method that makes maximum use of a classifier by improving its accuracy. The classifier method is used as a subroutine to build an extremely accurate classifier in the training set. Boosting applies the classification system repeatedly on the training data, but in each step the learning attention is focused on different examples of this set using adaptive weights. Once the process has finished, the single classifiers obtained are combined into a final, highly accurate classifier in the training set. The final classifier therefore usually achieves a high degree of accuracy in the test set, as various authors have shown both theoretically and empirically. Out of the several versions of boosting algorithms, the best known for binary classification problems is AdaBoost.
  • 25. 25 It is worth to highlight that the boosting function allows quantifying the relative importance of the predictor variables. Understanding a small individual tree can be easy. However, it is more difficult to interpret the hundreds or thousands of trees used in the boosting ensemble. Therefore, to be able to quantify the contribution of the predictor variables to the discrimination is a really important advantage. The measure of importance takes into account the gain of the Gini index given by a variable in a tree and the weight of this tree in the case of boosting. The AdaBoost technique was applied on the dataset in this project and after hundred iterations and adaptive weights, it output the importance of the variables in determining the binary output. The result is shown in Figure 32 below. Figure 32: Relative importance of each variable in the classification task It can be observed that Boosting algorithm resulted in giving the maximum importance to the variable X6 - the repayment status in March, 2016. This is in concordance with the earlier normal tree structure and also makes sense as the status of a credit card holder next month would depend a lot on his previous month’s repayment status. However, not all the results of AdaBoost technique could be explained theoretically. For the final tree model from AdaBoost, predictions were done for training as well as testing sample and the results are delineated in Table 6 below – Table 6: AdaBoosting Classification tree model_ in and out sample performance metrics Model performance metric False Positive Rate False Negative Rate Misclassification Rate In-sample 0.0842 0.4342 0.1893 Out-sample 0.1607 0.2891 0.1678
  • 26. 26 It can be clearly observed that the model performance has increased a lot as compared to normal classification tree when AdaBoost technique was applied to the same dataset. As explained earlier, because of the unique technique that this method follows, it’s performance boosts up in the testing dataset and concordant effects are observed in the results shown in Table 6. Even though the False Positive Rate has increased for testing dataset, the more important metric – False Negative Rate has decreased significantly; and this has led to an overall decrease in the error rate. 5.3. Artificial Neural Network Artificial neural networks use non-linear mathematical equations to successively develop meaningful relationships between input and output variables through a learning process. We applied back propagation networks to classify data. A back propagation neural network uses a feed-forward topology and supervised learning. The structure of back propagation networks is typically composed of an input layer, one or more hidden layers, and an output layer, each consisting of several neurons. ANNs can easily handle the non-linear and interactive effects of explanatory variables. The major drawback of ANNs is – they cannot result in a simple probabilistic formula of classification. An ANN black box model was fit to the training dataset and after 500 iterations, the model converged. As this is a black box model, the details of the model cannot be shown here. However, the in-sample and out-sample performance of the ANN model is given in Table 7 below – Table 7: ANN model in and out sample performance metrics Model performance metric False Positive Rate False Negative Rate Misclassification Rate In-sample 0.2980 0.5191 0.3467 Out-sample 0.3499 0.4402 0.3702 The ANN model performs poorly on this dataset and especially on the hold-out sample with 0.37 error rate. 5.4. Linear Discriminant Analysis Discriminant analysis, also known as Fisher’s rule, is another technique applied to the binary result of response variable. DA is an alternative to logistic regression and is based on the assumptions that, for each given class of response variable, the explanatory variables are distributed as a multivariate normal distribution with a common variance–covariance matrix. The objective of Fisher’s rule is to maximize the distance between different groups and to minimize the distance within each group. The pros and cons of DA are similar to those of LR. Hence, assuming the underlying explanatory variables are normally distributed, discriminant analysis model was applied on the training dataset. To check the model in-sample and out-sample performance, the response variable was predicted using the cut-off probability as 0.2 according to the traditional value used by the company for default predictions. The ROC curves for in-sample and out-sample predictions of full logistic model are shown in Figure 33 and Figure 34 respectively.
  • 27. 27 Figure 33: ROC Curve for in-sample LDA model predictions Figure 34: ROC Curve for out-sample LDA model predictions The in-sample and out-sample performance of the full logistic model is given in 8 below – Table 8: LDA model in and out sample performance metrics Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate In-sample 0.7723 0.1383 0.4549 0.2080 Out-sample 0.7697 0.1326 0.4461 0.2030 The linear discriminant analysis model is able to predict the defaults for training dataset with 0.2080 error rate and with 0.2030 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and out sample predictions is around 0.77. It can be concluded that LDA is a good fit for the data and shows a considerable prediction power. 6. MODEL COMPARISON We have built various models on the training dataset and checked their performance on both training and well as testing dataset in predicting defaults. For defaults, False Negatives affect the business more than False positives. So, lesser False Negatives are desired for the predictions. To compare performance of all the models, a cost function was introduced with 5 times as much penalty for False Negatives as compared to that for False Positives. The comparison and model performance summary is given in Table 9 below –
  • 28. 28 Table 9: Comparison of in and out sample metrics of all models Model AUC FP FN Error Rate Cost 1. Logistic Regression In-sample 0.7719 0.2038 0.3849 0.2437 0.58308 Out-sample 0.7717 0.2016 0.3838 0.2425 0.58726 1.1 Logistic after PCA In-sample NA 0.1341 0.2862 0.1589 0.4164 Out-sample NA 0.1653 0.2948 0.1732 0.4327 2. Classification Tree In-sample 0.7284 0.2693 0.3288 0.2824 0.57225 Out-sample 0.7304 0.2713 0.3378 0.2862 0.58959 2.1 AdaBoost Classifier In-sample NA 0.0842 0.4342 0.1893 0.5196 Out-sample NA 0.1607 0.2891 0.1678 0.4273 3. ANN In-sample NA 0.2980 0.5191 0.3467 0.80442 Out-sample NA 0.3499 0.4402 0.3702 0.76563 4. LDA In-sample 0.7723 0.1383 0.4549 0.2080 0.66475 Out-sample 0.7697 0.1326 0.4461 0.2030 0.60377 It can be observed that although some methods (e.g. LDA) have lower misclassification rate, due to their high False Negative rare and asymmetric cost function of the business problem, the overall cost is high. ANN model is performing poorly in both in-sample and out-sample datasets with a very high cost value. Logit Model and Classification Tree have almost the same cost function value. However, after applying PCA to the dataset and using the new reduced dimensioned data for logistic regression, the results are much better as compared to the normal logit regression. Similarly, AdaBoosting method also improves the performance of the Classification tree by adaptive boosting technique. The error rate of logistic using PCA is comparable to that of AdaBoosting but the False Negative rate is lesser in the training dataset, making it better model. However, according to the business requirement, either Logistic Model (after PCA) or AdaBoost Classifier could be used to predict the default credit card holders for the next month. 7. CONCLUSION This exercise has enabled us to predict the customers that would likely get defaulted next month using the past 6 months’ data for that credit card holder, viz. his payment and default history. Various classifiers were built considering the problem statement and comparison was done based on the rates of False Positives and False Negatives. It was observed that Logistic Regression model (after PCA) and AdaBoost Classifier perform best amongst all the models and hence could be accepted. It can also be said, our model may perform even better if we incorporate a few more variables for which data is not readily available. For example, credit score, income, etc. would train the data better than what we have now. Having said that, considering only past 6 months’ data for predicting defaults in the immediate future is not the best way to solve the problem. Many extant data mining and machine learning techniques are booming in the financial industry to predict credit card customer defaults.
  • 29. 29 8. REFERENCES – 1. http://www.goodfinancialcents.com/credit-card-default-debt-consequences-results/ 2. http://www.bestcreditcardrates.com 3. http://www.philly.com/philly/blogs/inq-phillydeals/US-credit-card-borrowing-surges-more- defaults-soon.html 4. https://www.federalreserve.gov/econresdata/ 5. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007). “A Comparison of Decision Tree Ensemble Creation Techniques.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 173–180. 6. https://www.jstatsoft.org/index 7. http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/