Predicting Credit Card Defaults using Machine Learning Algorithms

1
MS-CAPSTONE
(BANA 6064)
CARL H. LINDNER COLLEGE OF BUSINESS
SUMMER 2016
PREDICTING CREDIT CARD DEFAULTS
Understanding the concept of default, why it happens and the components used to predict the
default of credit card holders
Submitted in Partial Fulfillment for the Requirements for the Degree of Master of Science in
Business Analytics
TO:
Prof. Yichen Qin (1st Reader)
Prof. Peng Wang (2nd Reader)
BY:
Sagar Vinaykumar Tupkar
tupkarsr@mail.uc.edu
M08773948

2
ABSTRACT
Credit Card defaults poses a major problem to all the major financial service providers today as they have
to invest a lot of money in collection strategy, which again is uncertain. The analysts in financial industry
today have achieved great success in plotting a method to predict the default of credit card holder based
on various factors. This study aims at using the previous 6 months’ data of the customer to predict
whether the customer will go default in the next month by various statistical and data mining techniques
and building different models for the same. The exploratory data analysis part is also important to check
the distributions and patterns followed by the customers which eventually lead to default. Out of the four
models built, Logistic Regression after doing Principal Component Analysis and Adaptive Boosting
Classifier performed the best in predicting defaults with around 83% accuracy and minimizing the penalty
to the company. This study gave list of important variables that affects the model and should be
considered for predicting defaults. Even though the accuracy of the predictions is good, further research
and powerful techniques can potentially enhance the results and bring a revolution in the credit card
industry.

3
Contents
ABSTRACT..................................................................................................................................................2
1. INTRODUCTION.................................................................................................................................4
1.1. Credit-Card Default Definition – ...............................................................................................4
1.2. Background and Current Situation of Credit Card Defaults –...................................................4
2. OBJECTIVE OF THE STUDY –..............................................................................................................4
3. DATA .................................................................................................................................................5
4. EXPLORATARY DATA ANALYSIS.........................................................................................................9
4.1. Gender based Distribution:.....................................................................................................10
4.2. Education based Distribution:.................................................................................................10
4.3. Age based distribution:...........................................................................................................11
4.4. Marital Status based Distribution:..........................................................................................11
4.5. Credit-Line based distribution: ...............................................................................................12
4.6. Distribution of Payment statistics in October 2015................................................................13
4.7. Distribution of Payment statistics in November 2015............................................................13
4.8. Distribution of Payment statistics in December 2015 ............................................................14
4.9. Distribution of Payment statistics in January 2016.................................................................15
4.10. Distribution of Payment statistics in February 2016...........................................................15
4.11. Distribution of Payment statistics in March 2016...............................................................16
5. MODEL PREPARATION ....................................................................................................................17
5.1. Logistic Regression Model.......................................................................................................17
5.2. Classification Tree ...................................................................................................................22
5.3. Artificial Neural Network ........................................................................................................26
5.4. Linear Discriminant Analysis ...................................................................................................26
6. MODEL COMPARISON.....................................................................................................................27
7. CONCLUSION...................................................................................................................................28
8. REFERENCES –.................................................................................................................................29

4
1. INTRODUCTION
1.1. Credit-Card Default Definition –
When a customer applies for and receive a credit card, it becomes a huge responsibility for customer as
well as the credit card issuing company. The credit card company evaluates the customer’s credit
worthiness and gives him/her a line of credit that they feel the customer can be responsible for. While
most people will use their card to make purchases and then diligently make payments on what they
charge, there are some people who, for one reason or another, do not keep up on their payments and
eventually go into credit card default.
Credit card default is the term used to describe what happens when a credit card user makes purchases
by charging them to their credit card and then they do not pay their bill. It can occur when one payment
is more than 30 days past due, which may raise your interest rate. Most of the time, the term default is
used informally when the credit card payment is more than 60 days past due. A default has a negative
impact on the credit report and most likely lead to higher interest rates on future borrowing.
1.2. Background and Current Situation of Credit Card Defaults –
The U.S. economy is growing at just 2.5% a year, but credit card lending is rising more than twice as fast:
5% over year-earlier levels each month since last fall, accelerating to 6% in March 2016 and April 2016,
says Federal Reserve data. That's the fastest card debt has grown since card lending fell in the 2009
recession and since Americans aren't earning that much more, won't delinquencies, charge-offs and
bankruptcies be rising in another year or two?
As a matter of fact, for the 9 dominant U.S. credit card banks, which control 70% of the Visa-MasterCard-
American Express-Discover-Chinapay market in the U.S., average charge-offs in early 2016 were 3.13% of
annualized average loans, "down from a peak of 9.9% in 2009."
In recent years, the credit card issuers are facing the cash and credit card debt crisis as they have been
over-issuing cash and credit cards to unqualified applicants, in order to increase their market share. At
the same time, most cardholders, irrespective of their repayment ability, overused credit card for
consumption and accumulated heavy credit and cash–card debts. The crisis is an omen for the blow to
consumer finance confidence and it is a big challenge for both banks and cardholders.
2. OBJECTIVE OF THE STUDY –
This project is an attempt to identify credit card customers who are more likely to default in the coming
month. A lot of credit card issuing companies are working on predictive models which would help them
predict the payment status of the customer ahead of time using the customer’s credit score, credit history,
payment history and other factors. A lot of statistical models to predict delinquency are extant in the

5
financial industry today, however, as a famous quote goes, “All models are wrong, only some are useful”,
our attempt to build another predictive model has its own small impact in the research today.
This project is aimed at using customer’s personal and financial information like credit line, age,
repayment and delinquency history for the past 6 months to predict the probability of the particular
customer to become default next month. Many statistical and data mining techniques will be used to build
a binary predictive model.
If the credit card issuing companies can effectively predict the imminent default of customers beforehand,
it will help them to pursue targeted customers and take calculated efforts to avoid the default, to
overcome future losses efficiently.
3. DATA
As mentioned earlier, this project will use customer’s personal and financial information like credit line,
age, repayment and delinquency history for the past 6 months to predict the probability of the particular
customer to become default next month. The data was provided by one of the famous credit card issuing
banks of the USA and contains proprietary information about the customers (e.g. account numbers, which
have been masked). The data, in any sense, does not directly reveal identity of any individual or provide
information that could be decrypted to connect to an individual.
In this project, the plan is to predict the probability of credit-card holders to go default in the next month
by using payment data from October 2015 to March 2016. Among the total 30,000 observations, 6636
observations (22.12%) are the cardholders with default payment. To determine the binary variable –
default payment in April 2016 (Yes = 1, No = 0), as the response variable, the following 23 variables would
be used as explanatory variables:
1. X1: Amount of the given credit (USD): it includes both the individual consumer credit and his/her
family (supplementary) credit.
Including this variable in the study is important as the credit line of a customer is a good indicator
of the financial credit score of the customer. Using this variable will help the model predict
defaults more effectively.
2. X2: Gender (1 = male; 2 = female)
It might be useful to see whether the gender of the customer is in any way related to his/her
probability of default. The distribution of defaults based on gender will be an interesting chart to
look at.
3. X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others
It might be useful to see whether the education level of the customer is in any way related to
his/her probability of default. The distribution of defaults based on education level will be an
interesting chart to look at.

6
4. X4: Marital status (1 = married; 2 = single; 3 = others)
It might be useful to see whether the marital status of the customer is in any way related to his/her
probability of default. The distribution of defaults based on marital status will be an interesting
chart to look at.
5. X5: Age (year)
It might be useful to see whether the gender of the customer is in any way related to his/her
probability of default. The distribution of defaults based on age buckets will be an interesting
chart to look at.
6. X6 - X11: History of past payment. Customers’ past monthly payment records (from October 2015
to March, 2016) were tracked and used in the dataset as follows:
X6 = the repayment status in March, 2016;
X7 = the repayment status in February, 2016;
. . .;
X11 = the repayment status in October, 2015.
The measurement scale for the repayment status is:
-2 = Minimum due payment scheduled for 60 days
-1 = Minimum due payment scheduled for 30 days
0 = pay duly;
1 = payment delay for one month;
2 = payment delay for two months;
. . .;
8 = payment delay for eight months and above;
This information is very crucial as it directly provides the payment status of the customer for the
past 6 months. Using these variables will train the model efficiently to predict defaults.
7. X12-X17: Amount of bill statement (USD)
X12 = amount of bill statement in March, 2016;
X13 = amount of bill statement in February, 2016;
. . .;
X17 = amount of bill statement in October, 2015.

7
Actual bill statements of the customers for the past 6 months would give a quantitative estimate
for the amount spent by the customer using the credit card.
8. X18-X23: Amount of previous payment (USD)
X18 = amount paid in March, 2016;
X19 = amount paid in February, 2016;
. . .;
X23 = amount paid in October, 2015.
Amount of USD paid by the customers in past 6 months would give the repayment ability of the
customer and the pattern for payment could be used to train the model efficiently.
The variable names, example, data type and description is provided below in Figure 1.
Column name Variable Names Example Data Type Description
ID ID 23 Integer Masked Account numbers of Customers
Y default payment next month 1 Binary Binary variable (1,0) with 1 being customer defaults in next month
X1 LIMIT_BAL 2170 Continuous numeric Credit line of the customer
X2 SEX 2 Factor Gender of the customer 1= male 2=female
X3 EDUCATION 2 Factor Education (1 = graduate school; 2 = university; 3 = high school; 4 = others)
X4 MARRIAGE 2 Factor Marital status (1 = married; 2 = single; 3 = others)
X5 AGE 26 Integer Age (year)
X6 PAY_1 2 Factor repayment status in March, 2016
X7 PAY_2 0 Factor repayment status in February, 2016
X8 PAY_3 0 Factor repayment status in January, 2016
X9 PAY_4 2 Factor repayment status in December, 2015
X10 PAY_5 2 Factor repayment status in November, 2015
X11 PAY_6 2 Factor repayment status in October, 2015
X12 BILL_AMT1 1273.70 Continuous numeric amount of bill statement in March, 2016 (in USD)
X13 BILL_AMT2 1315.80 Continuous numeric amount of bill statement in February, 2016 (in USD)
X14 BILL_AMT3 1395.62 Continuous numeric amount of bill statement in January, 2016 (in USD)
X15 BILL_AMT4 1364.19 Continuous numeric amount of bill statement in December, 2015 (in USD)
X16 BILL_AMT5 1454.06 Continuous numeric amount of bill statement in November, 2015 (in USD)
X17 BILL_AMT6 1426.37 Continuous numeric amount of bill statement in October, 2015 (in USD)
X18 PAY_AMT1 62.22 Continuous numeric amount paid in March, 2016 (in USD)
X19 PAY_AMT2 111.04 Continuous numeric amount paid in February, 2016 (in USD)
X20 PAY_AMT3 0.00 Continuous numeric amount paid in January, 2016 (in USD)
X21 PAY_AMT4 111.63 Continuous numeric amount paid in December, 2015 (in USD)
X22 PAY_AMT5 0.00 Continuous numeric amount paid in November, 2015 (in USD)
X23 PAY_AMT6 56.42 Continuous numeric amount paid in October, 2015 (in USD)
Figure 1: Data Dictionary

8
In order to get the gist of the data, Figure 2 shows a snapshot of a subset of observations in the dataset,
Figure 3 provides the information like datatype and levels etc. about the variables and Figure 4 provides
a statistical summary of the data –
Figure 3: Description of variables in the dataset
Figure 2: Snapshot of the top 15 observations of the dataset

9
Figure 4: Summary of the variables in the dataset
4. EXPLORATARY DATA ANALYSIS
From the description of the data above, it can be concluded that the data did not have any null values for
any of the variables. We will start by doing an initial exploratory data analysis by looking at the distribution
of different variables with Y=0 and Y=1; so that the behavior of default and non-default customers can be
analyzed.
The value of the variables was aggregated and the total was plotted against the number of customers;
hence a frequency chart was prepared for the class variables and insights were drawn from the
visualizations. The results were separately presented for Y=0 and Y=1 i.e. default and non-default
customers (red being default and green being non-default) so that the analysis becomes easier.

10
4.1. Gender based Distribution:
The bar chart was plotted for distribution of customers based on their gender and is shown in Figure 5.
The result is separated by default and non-default customers (red being default and green being non-
default).
It can be observed that out of the 12k male credit card holders, 24.17% of the customers were default
whereas out of the 18k female credit card holders, 20.78% of the customers were default. Although the
total number of female customers are more than male customers, the percentage of male default
customers are more than that of the female customers.
4.2. Education based Distribution:
The bar chart was plotted for distribution of customers based on their education and is shown in Figure
6. The result is separated by default and non-default customers (red being default and green being non-
default).
Figure 6: Distribution by Education
Figure 5: Distribution by Gender

11
It can be observed that most of the credit card holders (~14k) are University pass-outs followed by
Graduates and High School pass-outs. Although the percentage of default doesn’t vary significantly
amongst these customers based on their education, it is worth noticing that 25.16% of the high school
pass-out customers get defaulted while the number decreased to 23.73% for University and 19.23% for
Graduate School.
4.3. Age based distribution:
The customers were categorized into bins and bar chart plotted for distribution of customers based on
their age is shown in Figure 7. The result is separated by default and non-default customers (red being
default and green being non-default).
Figure 7: Distribution by Age
It can be observed that the credit card holders who are aged less than 25 years have the maximum
(~27.20%) of default proportion followed by the age group 45-60 years (25.13%). While the number of
customers in age group 25-35 years are maximum in numbers, their default proportion (~20.30%) is pretty
decent.
4.4. Marital Status based Distribution:
The bar chart was plotted for distribution of customers based on their marital status and is shown in Figure
8. The result is separated by default and non-default customers (red being default and green being non-
default).

12
Figure 8: Distribution by Marital Status
It can be observed that the credit card holders who are married have the maximum (~23.47%) of default
proportion as compared to ~20.93% of Single customers. Although married customers are lesser in
number than single customers, the default proportion is higher amongst married credit card holders.
4.5. Credit-Line based distribution:
The customers were categorized into bins and bar chart plotted for distribution of customers based on
their credit line is shown in Figure 9. The result is separated by default and non-default customers (red
being default and green being non-default).
Figure 9: Distribution by Credit Line

13
It can be observed that for the customers whose credit line is between $1000-$5000 are maximum in
number and have 2nd
highest default proportion (24.46%) after customers with credit line between $500-
$1000 which is alarmingly high (35.30%)
4.6. Distribution of Payment statistics in October 2015
For the month of October 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 10. Also, the total Statement Bill amount and the
amount paid by the credit card holder was divided into default and non-default customers to analyze their
situation separately and is shown below in Figure 11.
Figure 10: Repayment status distribution in Oct'15 Figure 11:Bill and Paid amount distribution in Oct'15
It can be observed that in the month of October 2015, while most of the customers paid their loan duly,
almost 50% of the customers who had a payment delay for 2 months went default in April 2016. On the
other hand, customers who went default in April 2015 paid only ~9% of the bill statement in October 2015
as opposed to ~14% by non-default customers.
4.7. Distribution of Payment statistics in November 2015
For the month of November 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown Figure 12. Also, the total Statement Bill amount and the

14
Figure 12: Repayment status distribution in Nov'15 Figure 13: Bill and Paid amount distribution in Nov'15
It can be observed that in the month of November 2015, while most of the customers paid their loan duly,
other hand, customers who went default in April 2015 paid only ~8% of the bill statement in November
2015 as opposed to ~13% by non-default customers.
4.8. Distribution of Payment statistics in December 2015
For the month of November 2015, the number of credit card holders were distributed based on their
repayment history code above and is shown Figure 14. Also, the total Statement Bill amount and the
Figure 14: Repayment status distribution in Dec'15 Figure 15: Bill and Paid amount distribution in Dec'15

15
It can be observed that in the month of December 2015, while most of the customers paid their loan duly,
other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in December
4.9. Distribution of Payment statistics in January 2016
For the month of January 2016, the number of credit card holders were distributed based on their
Figure 16: Repayment status distribution in Jan'16 Figure 17: Bill and Paid amount distribution in Jan'16
It can be observed that in the month of January 2016, while most of the customers paid their loan duly,
other hand, customers who went default in April 2015 paid only ~7.5% of the bill statement in January
4.10. Distribution of Payment statistics in February 2016
For the month of February 2016, the number of credit card holders were distributed based on their

16
Figure 18: Repayment status distribution in Feb'16 Figure 19: Bill and Paid amount distribution in Feb'16
It can be observed that in the month of February 2016, while most of the customers paid their loan duly,
other hand, customers who went default in April 2015 paid only ~7% of the bill statement in February
4.11. Distribution of Payment statistics in March 2016
For the month of March 2016, the number of credit card holders were distributed based on their
repayment history code above and is shown in Figure 20. Also, the total Statement Bill amount and
the amount paid by the credit card holder was divided into default and non-default customers to analyze
their situation separately and is shown below in Figure 21.
Figure 20: Repayment status distribution in Mar'16 Figure 21: Bill and Paid amount distribution in Mar'16

17
It can be observed that in the month of March 2016, while most of the customers paid their loan duly,
almost 70% of the customers who had a payment delay for 2 months and 34% of the customers who had
a payment delay for 1 month went default in April 2016. On the other hand, customers who went default
in April 2015 paid only ~7% of the bill statement in March 2016 as opposed to ~12% by non-default
customers.
5. MODEL PREPARATION
The aim of this exercise is to build a model using the variables explained in the earlier section to predict
the credit card holders which would go default next month. The data that would be used to train the
model would be past 6 months financial, delinquency and payment history. Hence, in order to build the
model, we have divided the data into training (80%) and testing (20%) subsets. Multiple classifiers were
used to build the model using the training dataset which contained 24,000 observations and were
compared based on the various model performance metrics. We will go through each model separately
and discuss the scope, performance and pros-cons related to every classifier method.
5.1. Logistic Regression Model
Logistic regression can be considered a special case of linear regression models. However, the binary
response variable violates normality assumptions of general regression models. A logistic regression
model specifies that an appropriate function of the fitted probability of the event is a linear function of
the observed values of the available explanatory variables. The major advantage of this approach is that
it can produce a simple probabilistic formula of classification. The weaknesses are that LR cannot properly
deal with the problems of non-linear and interactive effects of explanatory variables.
A logistic regression model was fit on the training dataset using all the variables and the summary of the
model can be seen in Table 1 below –
Table 1: Logistic full model summary
Logistic Regression AIC Null Deviance Residual Deviance
Model Summary 20979 25314 20815
It was observed that some of the dummy variables that were created because of the presence of
class/nominal variables in the dataset were not significant in the above full logistic regression model and
hence a stepwise variable selection method for regression was performed.
5.1.1. Stepwise Variable Selection Method
By performing stepwise variable selection, it was observed that some of the variables were omitted
because they were insignificant in the full model. Finally, the new model was –
Y ~ X1 + X2 + X3 + X4 + X6 + X7 + X8 + X9 + X10 + X11 + X12 + X13 + X18 + X19 + X20 + X22

18
It can be seen that the variables X5, X14, X15, X16, X17, X21 and X23 were omitted from the model. The
summary of the new model can be seen in Table 2 below –
Table 2: Logistic stepwise model summary
Logistic Stepwise Regression AIC Null Deviance Residual Deviance
Model Summary 20970 25314 20820
The variables that were omitted viz. age, bill statement amounts and paid amounts of few months are
important according to business knowledge and are highly recommended to be kept in the model to be
used. Moreover, even after removing these variables, the AIC of the model hasn’t decreased significantly.
5.1.2. LASSO variable selection –
In statistics and machine learning, lasso (least absolute shrinkage and selection operator) (also Lasso or
LASSO) is a regression analysis method that performs both variable selection and regularization in order
to avoid overfitting, enhance the prediction accuracy and interpretability of the statistical model it
produces. In this project, 5-fold cross validation method was used to select the Tuning Parameter (lambda)
that gets inside the LASSO optimization problem.
The entire data was used to do 5-fold cross validation and LASSO variable selection and the behavior of
binomial distribution for different tuning parameter- lambda was plotted. The optimum value of lambda
is given by the vertical line in the plot shown below. Hence the value of tuning parameter, lambda is
around 0.004.
Figure 22: Binomial Deviance plot to choose the tuning parameter-lambda
Now, using this lambda, variable selection was done using LASSO variable selection method. As far as the
Null Deviance of the model is concerned, it was 31704.85 and hence because of the higher value as
compared to full logistic and stepwise model, this model couldn’t be used.

19
After comparing the three different versions of logistic model, it was concluded that the full logistic model
has better parameter values than the other two. Hence, we will use the full logistic regression model
instead of the reduced model or LASSO fit for further analysis.
To check the model in-sample and out-sample performance, the response variable was predicted using
the cut-off probability as 0.2 according to the traditional value used by the company for default
predictions. The ROC curves for in-sample and out-sample predictions of full logistic model are shown in
Figure 23 and Figure 24 respectively.
Figure 23: ROC Curve for in-sample Logistic model predictions Figure 24: ROC Curve for out-sample Logistic model predictions
The in-sample and out-sample performance of the full logistic model is given in Table 3 below –
Table 3: Logistic full model in and out sample performance metrics
Model performance metric AUC False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.7719 0.2038 0.3849 0.2437
Out-sample 0.7717 0.2016 0.3838 0.2425
The logistic full regression model is able to predict the defaults for training dataset with 0.2437 error rate
and with 0.2425 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and out
sample predictions is around 0.77. It can be concluded that logistic regression is a good fit for the data
and shows a considerable prediction power.

20
5.1.3. Principal Component Analysis
Principal components analysis is a procedure for identifying a smaller number of uncorrelated variables,
called "principal components", from a large set of data. The goal of principal components analysis is to
explain the maximum amount of variance with the fewest number of principal components. Principal
components analysis is commonly used in the social sciences, market research, and other industries that
use large data sets.
Principal components analysis is commonly used as one step in a series of analyses and can be used to
reduce the number of variables and avoid multicollinearity. The other main advantage of PCA is that once
the patterns in the data are found, it can be compressed, i.e. by reducing the number of dimensions,
without much loss of information.
To avoid the effect of multicollinearity in the predictions, the variables were standardized and
dimensionality reduction was applied to the dataset using principal component analysis (PCA). The
variance that was observed in the direction of various components was plotted and is showed in Figure
25 below –
Figure 25: Variance distribution for Principal Components
The Principal component analysis method produced 15 principal components that could explain the data
almost as efficiently as the original variables did. However, the variance explained by the first few principal
components constitute a major amount of the total variance in the dataset. In order to choose the number
of principal components, ‘Elbow Method’ was used and the line plot for the same is shown in Figure 26
below –

21
Figure 26: Elbow curve for PCA to decide the number of PCs
It can be observed that after 3rd or 4th principal component, the contribution of variance explanation is
not significant and hence we will keep only 4 principal components for further analysis of the dataset as
they account for a cumulative of more than 80% of the total variance.
The main purpose of applying PCA on the dataset was to try reducing the effect of multicollinearity and
decrease the number of dimensions so that the model performance might get better. Hence, a logistic
regression was run on the new dataset with reduced dimensions and predictions were made for the same.
It was observed that after applying PCA, the logistic model got trained more efficiently and at the same
time its predictive power also got better.
The results of the confusion matrix are given in the Table 4 below –
Table 4: Logistic model after PCA_ in and out sample performance metrics
Model performance metric False Positive Rate False Negative Rate Misclassification Rate
In-sample 0.1341 0.2862 0.1589
Out-sample 0.1653 0.2948 0.1732
A conclusion can be made that PCA reduced the dimensions and also the effect of multicollinearity in the
model performance and hence the misclassification rate is lower as compared to the normal logistic model
with more dimensions. Although performance on hold out sample is not as good as that on the training
sample, the difference is not significant.

22
5.2. Classification Tree
In a classification tree structure, each internal node denotes a test on an attribute, each branch represents
an outcome of the test, and leaf nodes represent classes. The top-most node in a tree is the root node.
CTs are applied when the response variable is qualitative or quantitative discrete. Classification trees
perform a classification of the observations on the basis of all explanatory variables and supervised by the
presence of the response variable. The segmentation process is typically carried out using only one
explanatory variable at a time. CTs are based on minimizing impurity, which refers to a measure of
variability of the response values of the observations. CTs can result in simple classification rules and can
handle the nonlinear and interactive effects of explanatory variables. But their sequential nature and
algorithmic complexity can make them depends on the observed data, and even a small change might
alter the structure of the tree. It is difficult to take a tree structure designed for one context and generalize
it for other contexts.
A classification tree model was fit on the training dataset and the results were analyzed. The classification
tree is shown in Figure 27 below –
Figure 27: Classification Tree model diagram
5.2.1. Complexity Parameter tuning and pruning –
The tree obtained with the default complexity parameter “cp” =0.01 has 5 nodes as shown in Figure 27.
However, it is necessary to tune the complexity parameter according to the error change with addition
of every node. A plot for change in relative error is shown in Figure 28, which gives the optimal value of
“cp” and hence the size of the tree.

23
Figure 28: Relative error vs Complexity Parameter
As, the value of “cp” increases, complexity if the tree decreases. It can be concluded that the relative
error increases after the size of tree is more than 3 (“cp”=0.05) and hence, it is not beneficial to keep the
size of the tree more than 3. The tree shown in Figure 27 is prune using the “cp” value as 0.05 based on
the observation from Figure 26 and the final tree is shown in Figure 29 below –
Figure 29: Final Classification Tree after pruning

24
The ROC curves for in-sample and out-sample predictions of Classification Tree model are shown in
Figure 30 and Figure 31 respectively.
Figure 30: ROC Curve for in-sample Classification Tree predictions Figure 31: ROC Curve for out-sample Classification Tree predictions
The in-sample and out-sample performance of the Classification Tree model is given in Table 5 below –
Table 5: Classification Tree model in and out sample performance metrics
In-sample 0.7284 0.2693 0.3288 0.2824
Out-sample 0.7304 0.2713 0.3378 0.2862
The Classification Tree model is able to predict the defaults for training dataset with 0.2824 error rate and
with 0.2862 error rate for test dataset. Also, the AUC for ROC curves pertaining in-sample is 0.7284 and
out sample predictions is around 0.7304. It can be concluded that Classification Tree is a good fit for the
data and shows a considerable prediction power.
5.2.2. Adaptive Boosting (AdaBoost)
Boosting is a method that makes maximum use of a classifier by improving its accuracy. The classifier
method is used as a subroutine to build an extremely accurate classifier in the training set. Boosting
applies the classification system repeatedly on the training data, but in each step the learning attention is
focused on different examples of this set using adaptive weights. Once the process has finished, the single
classifiers obtained are combined into a final, highly accurate classifier in the training set. The final
classifier therefore usually achieves a high degree of accuracy in the test set, as various authors have
shown both theoretically and empirically. Out of the several versions of boosting algorithms, the best
known for binary classification problems is AdaBoost.

25
It is worth to highlight that the boosting function allows quantifying the relative importance of the
predictor variables. Understanding a small individual tree can be easy. However, it is more difficult to
interpret the hundreds or thousands of trees used in the boosting ensemble. Therefore, to be able to
quantify the contribution of the predictor variables to the discrimination is a really important advantage.
The measure of importance takes into account the gain of the Gini index given by a variable in a tree and
the weight of this tree in the case of boosting.
The AdaBoost technique was applied on the dataset in this project and after hundred iterations and
adaptive weights, it output the importance of the variables in determining the binary output. The result
is shown in Figure 32 below.
Figure 32: Relative importance of each variable in the classification task
It can be observed that Boosting algorithm resulted in giving the maximum importance to the variable X6
- the repayment status in March, 2016. This is in concordance with the earlier normal tree structure and
also makes sense as the status of a credit card holder next month would depend a lot on his previous
month’s repayment status. However, not all the results of AdaBoost technique could be explained
theoretically. For the final tree model from AdaBoost, predictions were done for training as well as testing
sample and the results are delineated in Table 6 below –
Table 6: AdaBoosting Classification tree model_ in and out sample performance metrics
In-sample 0.0842 0.4342 0.1893
Out-sample 0.1607 0.2891 0.1678

26
It can be clearly observed that the model performance has increased a lot as compared to normal
classification tree when AdaBoost technique was applied to the same dataset. As explained earlier,
because of the unique technique that this method follows, it’s performance boosts up in the testing
dataset and concordant effects are observed in the results shown in Table 6. Even though the False
Positive Rate has increased for testing dataset, the more important metric – False Negative Rate has
decreased significantly; and this has led to an overall decrease in the error rate.
5.3. Artificial Neural Network
Artificial neural networks use non-linear mathematical equations to successively develop meaningful
relationships between input and output variables through a learning process. We applied back
propagation networks to classify data. A back propagation neural network uses a feed-forward topology
and supervised learning. The structure of back propagation networks is typically composed of an input
layer, one or more hidden layers, and an output layer, each consisting of several neurons. ANNs can easily
handle the non-linear and interactive effects of explanatory variables. The major drawback of ANNs is –
they cannot result in a simple probabilistic formula of classification.
An ANN black box model was fit to the training dataset and after 500 iterations, the model converged. As
this is a black box model, the details of the model cannot be shown here. However, the in-sample and
out-sample performance of the ANN model is given in Table 7 below –
Table 7: ANN model in and out sample performance metrics
In-sample 0.2980 0.5191 0.3467
Out-sample 0.3499 0.4402 0.3702
The ANN model performs poorly on this dataset and especially on the hold-out sample with 0.37 error
rate.
5.4. Linear Discriminant Analysis
Discriminant analysis, also known as Fisher’s rule, is another technique applied to the binary result of
response variable. DA is an alternative to logistic regression and is based on the assumptions that, for
each given class of response variable, the explanatory variables are distributed as a multivariate normal
distribution with a common variance–covariance matrix. The objective of Fisher’s rule is to maximize the
distance between different groups and to minimize the distance within each group. The pros and cons of
DA are similar to those of LR.
Hence, assuming the underlying explanatory variables are normally distributed, discriminant analysis
model was applied on the training dataset. To check the model in-sample and out-sample performance,
the response variable was predicted using the cut-off probability as 0.2 according to the traditional value
used by the company for default predictions. The ROC curves for in-sample and out-sample predictions of
full logistic model are shown in Figure 33 and Figure 34 respectively.

27
Figure 33: ROC Curve for in-sample LDA model predictions Figure 34: ROC Curve for out-sample LDA model predictions
The in-sample and out-sample performance of the full logistic model is given in 8 below –
Table 8: LDA model in and out sample performance metrics
In-sample 0.7723 0.1383 0.4549 0.2080
Out-sample 0.7697 0.1326 0.4461 0.2030
The linear discriminant analysis model is able to predict the defaults for training dataset with 0.2080 error
rate and with 0.2030 error rate for test dataset. Also, the AUC for ROC curves pertaining to both in and
out sample predictions is around 0.77. It can be concluded that LDA is a good fit for the data and shows a
considerable prediction power.
6. MODEL COMPARISON
We have built various models on the training dataset and checked their performance on both training and
well as testing dataset in predicting defaults. For defaults, False Negatives affect the business more than
False positives. So, lesser False Negatives are desired for the predictions. To compare performance of all
the models, a cost function was introduced with 5 times as much penalty for False Negatives as compared
to that for False Positives. The comparison and model performance summary is given in Table 9 below –

28
Table 9: Comparison of in and out sample metrics of all models
Model AUC FP FN Error Rate Cost
1. Logistic Regression In-sample 0.7719 0.2038 0.3849 0.2437 0.58308
Out-sample 0.7717 0.2016 0.3838 0.2425 0.58726
1.1 Logistic after PCA In-sample NA 0.1341 0.2862 0.1589 0.4164
Out-sample NA 0.1653 0.2948 0.1732 0.4327
2. Classification Tree In-sample 0.7284 0.2693 0.3288 0.2824 0.57225
Out-sample 0.7304 0.2713 0.3378 0.2862 0.58959
2.1 AdaBoost Classifier In-sample NA 0.0842 0.4342 0.1893 0.5196
Out-sample NA 0.1607 0.2891 0.1678 0.4273
3. ANN In-sample NA 0.2980 0.5191 0.3467 0.80442
Out-sample NA 0.3499 0.4402 0.3702 0.76563
4. LDA In-sample 0.7723 0.1383 0.4549 0.2080 0.66475
Out-sample 0.7697 0.1326 0.4461 0.2030 0.60377
It can be observed that although some methods (e.g. LDA) have lower misclassification rate, due to their
high False Negative rare and asymmetric cost function of the business problem, the overall cost is high.
ANN model is performing poorly in both in-sample and out-sample datasets with a very high cost value.
Logit Model and Classification Tree have almost the same cost function value. However, after applying
PCA to the dataset and using the new reduced dimensioned data for logistic regression, the results are
much better as compared to the normal logit regression. Similarly, AdaBoosting method also improves
the performance of the Classification tree by adaptive boosting technique. The error rate of logistic using
PCA is comparable to that of AdaBoosting but the False Negative rate is lesser in the training dataset,
making it better model. However, according to the business requirement, either Logistic Model (after PCA)
or AdaBoost Classifier could be used to predict the default credit card holders for the next month.
7. CONCLUSION
This exercise has enabled us to predict the customers that would likely get defaulted next month using
the past 6 months’ data for that credit card holder, viz. his payment and default history. Various classifiers
were built considering the problem statement and comparison was done based on the rates of False
Positives and False Negatives. It was observed that Logistic Regression model (after PCA) and AdaBoost
Classifier perform best amongst all the models and hence could be accepted.
It can also be said, our model may perform even better if we incorporate a few more variables for which
data is not readily available. For example, credit score, income, etc. would train the data better than what
we have now. Having said that, considering only past 6 months’ data for predicting defaults in the
immediate future is not the best way to solve the problem. Many extant data mining and machine learning
techniques are booming in the financial industry to predict credit card customer defaults.

29
8. REFERENCES –
1. http://www.goodfinancialcents.com/credit-card-default-debt-consequences-results/
2. http://www.bestcreditcardrates.com
3. http://www.philly.com/philly/blogs/inq-phillydeals/US-credit-card-borrowing-surges-more-
defaults-soon.html
4. https://www.federalreserve.gov/econresdata/
5. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007). “A Comparison of Decision Tree
Ensemble Creation Techniques.” IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(1), 173–180.
6. https://www.jstatsoft.org/index
7. http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/

Predicting Credit Card Defaults using Machine Learning Algorithms

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Predicting Credit Card Defaults using Machine Learning Algorithms

Similar to Predicting Credit Card Defaults using Machine Learning Algorithms (20)

Recently uploaded

Recently uploaded (20)

Predicting Credit Card Defaults using Machine Learning Algorithms