Construction of a robust prediction model to forecast the likelihood of a credit card holder to experience payment defaults in upcoming months.

Construction of a robust
prediction model to forecast the
likelihood of a credit card holder
to experience payment defaults in
upcoming months.
Xi global resources
Group of company
March 28, 2024

contents
Introduction
01
Resources
02
Methodology
03
Result and discussion
04
Conclusion
06
Recommendations
05

Introduction
Taiwan is a prominent financial institution. The company is known for provision of credit
card services to a substantial clientele. Similar to other credit card issuers, this corporation
encounters the task of forecasting the likelihood of a customer's payment default in the
forthcoming month.
It was stated by Augustin, the company project manager that, accurate forecasting of
defaults is essential for effectively managing risks and making informed decisions within
the organization.
After much meeting with the company board of director and management, it was
observed that there is a need for a professional data scientist would could create a robust,
reliable, efficient and unbiased model to solve the identified problem. In this regards I was
seek for as an analyst. The purpose of my service is to construct a robust prediction
model that can accurately forecast the likelihood of a credit card holder experiencing
payment default during the upcoming month.
The organization has offered a dataset that encompasses a range of elements pertaining to
the demographic information, repayment history, bill statements, and other pertinent
attributes of credit card customers. These characteristics will form the foundation for
constructing the predictive model.
Source: https://images.app.goo.gl/5w6oRm8eHyFVsNmo8
Odusanya, Hafeez. (2023). "INFLUENCE OF CREDIT RISK MANAGEMENT ON FINANCIAL
PERFORMANCE OF COMMERCIAL BANKS IN NIGERIA.
Research findings indicate that the act of making
payments can result in substantial financial
losses of a company, while also exerting an
influence on customer relationships and credit
risk.

Resources
Question addressed
The only resource provided is the dataset. This
dataset contains information on default payments,
demographic factors, credit data, history of
payment, and bill statements of credit card clients
in Taiwan from April 2005 to September 2005.
The dataset contains 25 variables such as:
 ID of the client
 Amount of given credit in NT dollars
 Gender
 Education
 Marrital status
 Age
 Payment status from April to September
 Amount of bill statement from April to
September
 Amount of previous payment from April to
September
 Default payment next month
 Which factors have the greatest impact on the probability of credit card holders defaulting
on their payments?
 Do demographic characteristics, including age, gender, education, and marital status, exhibit
a correlation with default payment behavior?
 What is the impact of repayment patterns, namely from PAY_0 to PAY_6, on the probability
of default payment?
 Is it possible to forecast future default payment behavior based on prior bill amounts
(BILL_AMT1 to BILL_AMT6) and previous payment amounts (PAY_AMT1 to
PAY_AMT6)?
 Do variations in default payment behavior exist among individuals with varying levels of
education or marital statuses?
 What is the relationship between the credit amount provided (LIMIT_BAL) and the rates of
default payments?
 Given historical data and demographic information, can a predictive model effectively
estimate default payment with a high degree of accuracy?
 Does the dataset exhibit any temporal patterns in default payment behavior during the six-
month duration?
 Are there specific cohorts of credit card customers who demonstrate a greater inclination
towards default payment, and if yes, what are the defining characteristics of these cohorts?
 What is the relationship between various combinations of repayment status and bill/payment
amounts and the results of default payments?
Dataset Information

Methods
In order to tackle this issue, we will utilize machine learning methodologies to build a prognostic model utilizing the dataset
furnished by the organization. The dataset was first transformed. This transformation was done to decode categorical variables
which by default where either in dummy variable or dichotomous. After this, a preliminary analysis was conducted to explore
and to understand the data structure. Exploratory analysis, to detect missing values and outliers. Considering the nature of the
dataset, which contains both qualitative and quantitative variables. A bar chart was used for qualitative variables, while boxplot
and density plot was used for the quantitative or continuous variables.
Also, the relationship between the quantitative variables were investigated by conducting a correlation matrix. After which the
dataset was split into training and test set with the ratio of 75 : 25. By splitting the data one ensures that a particular piece of it is to
be used only to train machine learning models. Another piece is to remain unused during the training process, but is rather used to
assess the model performance. Splitting the data prevents overfitting and allows for a more accurate evaluation of the model’s
ability to generalize to previously unknown data in machine learning model
At first, all the variables in the dataset was used to train the model, after that, a backward selection process was apply with a stop
condition to remove any variable with insignificant estimate. The level of significance for the post-selection, was established as
alpha 0.0.05. The Post-Selection method improves the model by continuing to iterate and identify and eliminate elements that do
not assist in the model.
The logistic regression model used is given by Y(x=1) =
𝑒𝑘
1+ 𝑒𝑘 , where k = α 𝑜 + 𝑋1α 𝑜 + 𝑋1α 𝑜 + …

Structure of the dataset
Figure 1: Variable contained in the dataset displaying the total number of observation by the
variable type(integer or numeric)

Figure 2: Variable contained in the dataset displaying the percentage of values present(or if
there is any missing values) by total number of observation.

Preliminary Analysis: Exploratory data analysis
Figure 3: Distribution of gender by default payment
next month

Figure 4: Distribution of marriage by default payment next month

Figure 5: Distribution of education by default payment next month

Figure 6: Distribution of repayment status in September, 2005 by
default payment next month

Figure 7:

Figure 8: Distribution of repayment status in July, 2005 by default
payment next month

Figure 9: Distribution of repayment status in June, 2005 by default
payment next month

Figure 10: Distribution of repayment status in May, 2005 by

Figure 11: Distribution of repayment status in April, 2005 by

Estimators Yes No
Mean 48509.16 51994.23
Std.Dev 73782.07 73577.61
Min -6676 -165580
Q1 2986.5 3676.5
Median 20185 23119.5
Q3 59667 69031
Max 613860 964511
MAD 29304.33 33292.52
IQR 56638.75 65349.75
CV 1.52 1.42
Skewness 2.97 2.58
Kurtosis 11.62 9.31
Figure 12: Distribution of amount of bill statement in
September, 2005 by default payment next month

Estimators Yes No
Mean 47283.62 49717.44
Std.Dev 71651.03 71029.95
Min -17710 -69777
Q1 2693 3054
Median 20300.5 21660.5
Q3 57920.5 65698
Max 581775 983931
MAD 29519.31 31535.64
IQR 55225.75 62631
CV 1.52 1.43
Skewness 2.97 2.63
Kurtosis 11.54 9.95
August, 2005 by default payment next month

Estimators Yes No
Mean 45181.6 47533.37
Std.Dev 68516.98 69576.66
Min -61506 -157264
Q1 2500 2768.5
Median 19834.5 20202.5
Q3 54734.5 61896
Max 578971 1664089
MAD 28828.42 29416.27
IQR 52233.75 59124.25
CV 1.52 1.46
Skewness 2.95 3.13
Kurtosis 11.34 22.03
July, 2005 by default payment next month

Estimators Yes No
Mean 42036.95 43611.17
Std.Dev 64351.08 64324.8
Min -65167 -170000
Q1 2141 2360
Median 19119.5 19000
Q3 50178.5 55993
Max 548020 891586
MAD 27679.4 27591.19
IQR 48034.25 53628
CV 1.53 1.47
Skewness 3 2.77
June, 2005 by default payment next month

Estimators Yes No
Mean 39540.19 40530.45
Std.Dev 61424.7 60617.27
Min -53007 -81334
Q1 1500.5 1823
Median 18478.5 17998
Q3 47856 51136.5
Max 547880 927171
MAD 26726.83 26092.28
IQR 46350.25 49312.25
CV 1.55 1.5
Skewness 3.03 2.83
May, 2005 by default payment next month

Estimators Yes No
Mean 38271.44 39042.27
Std.Dev 59579.67 59547.02
Min -339603 -209051
Q1 1150 1265
Median 18028.5 16679
Q3 47430 49844
Max 514975 961664
MAD 26150.84 24310.19
IQR 46274 48577
CV 1.56 1.53
Skewness 2.9 2.83
April, 2005 by default payment next month

Estimators Yes No
Mean 3397.04 6307.34
Std.Dev 9544.25 18014.51
Min 0 0
Q1 0 1163.5
Median 1636 2459.5
Q3 3478.5 5606.5
Max 300000 873552
MAD 2425.53 3068.24
IQR 3478.25 4442.5
CV 2.81 2.86
Skewness 14.77 13.94
Kurtosis 323.48 371.72
Figure 18: Distribution of amount of previous
payments in September, 2005 by default payment next
month

Estimators Yes No
Mean 3388.65 6640.47
Std.Dev 11737.99 25302.26
Min 0 0
Q1 0 1005
Median 1533.5 2247.5
Q3 3310.5 5311.5
Max 358689 684259
MAD 2273.57 2863.64
IQR 3309.75 4306.25
CV 3.46 3.81
Kurtosis 439.72 1439.86
Figure 19: Distribution of amount of previous payment
in August, 2005 by default payment next month

Estimators Yes No
Mean 3367.35 5753.5
Std.Dev 12959.62 18684.26
Min 0 0
Q1 0 600
Median 1222 2000
Q3 3000 5000
Max 508229 896040
MAD 1811.74 2799.89
IQR 3000 4400
CV 3.85 3.25
Kurtosis 492.11 537.52
in July, 2005 by default payment next month

Estimators Yes No
Mean 3155.63 5300.53
Std.Dev 11191.97 16689.78
Min 0 0
Q1 0 390
Median 1000 1734
Q3 2940.5 4602
Max 432130 621000
MAD 1482.6 2570.83
IQR 2939.25 4212
CV 3.55 3.15
Skewness 16.97 12.2
Kurtosis 463.45 248.77
in June, 2005 by default payment next month

Estimators Yes No
Mean 3219.14 5248.22
Std.Dev 11944.73 16071.67
Min 0 0
Q1 0 369
Median 1000 1765
Q3 3000 4600
Max 332000 426529
MAD 1482.6 2616.79
IQR 3000 4231
CV 3.71 3.06
Kurtosis 316.99 160.71
in May, 2005 by default payment next month

Estimators Yes No
Mean 3441.48 5719.37
Std.Dev 13464.01 18792.95
Min 0 0
Q1 0 300
Median 1000 1706
Q3 2975 4545
Max 345293 528666
MAD 1482.6 2529.32
IQR 2974.5 4245
CV 3.91 3.29
Skewness 12.66 10.2
Kurtosis 208.12 155.51
in April,2005 by default payment next month

Estimators Yes No
Mean 35.73 35.42
Std.Dev 9.69 9.08
Min 21 21
Q1 28 28
Median 34 34
Q3 42 41
Max 75 79
MAD 10.38 8.9
IQR 14 13
CV 0.27 0.26
Skewness 0.66 0.75
Kurtosis 0.11
Figure 24: Distribution of age by default payment next
month

Estimators Yes No
Mean 130109.7 178099.73
Std.Dev 115378.5 131628.36
Min 10000 10000
Q1 50000 70000
Median 90000 150000
Q3 200000 250000
Max 740000 1000000
MAD 88956 133434
IQR 150000 180000
CV 0.89 0.74
Skewness 1.35 0.91
Kurtosis 1.55 0.38
Figure 25: Distribution of amount of given credit bill
by default payment next month

Correlation analysis
Figure 26: Relationship between quantitative variables
 Figure 26 shows the correlation coefficients between various
pairs of variables in the dataset.
 There is a moderate positive correlation between the amounts
of bill statements in different months (e.g., BILL_AMT1 and
BILL_AMT2 have a correlation coefficient of 0.951).
 There are weak positive correlations between the amounts of
bill statements and the amounts of previous payments (e.g.,
PAY_AMT1 and BILL_AMT1 have a correlation coefficient
of 0.140).
 There are also weak positive correlations between the amounts
of bill statements and the repayment status (e.g., PAY_0 and
BILL_AMT1 have a correlation coefficient of 0.285).

Model
Figure 27: Model summary showing each of the
estimate, with significant levels.
 The coefficient estimates represent the change in the log-
odds of the dependent variable (default.payment.next.month)
for a one-unit increase in the predictor variable, holding all
other variables constant.
 For example, the coefficient for AGE is 0.008812. This
means that for every one-year increase in age, the log-odds
of default.payment.next.month increases by 0.008812 units.
 Similarly, the coefficient for PAY_0 is 0.5867. This suggests
that a one-unit increase in the PAY_0 variable (repayment
status in September 2005) results in a 0.5867 increase in the
log-odds of default.payment.next.month, holding all other
variables constant.

Model parameters
Figure 28: Visualization of the model estimates displaying
level of significant for each of the estimate.

Model
 The null deviance (23778) represents the difference
in deviance between the null model (with no
predictors) and the saturated model (with all
predictors).
 The residual deviance (20943) represents the
difference in deviance between the fitted model and
the saturated model.
 A lower residual deviance indicates a better fit of
the model to the data.
 The Akaike Information Criterion (AIC) is a
measure of the model's goodness of fit, balancing
the fit of the model with the number of parameters
used. Lower AIC values indicate better model fit.
 The significance codes indicate the statistical significance of
each coefficient estimate.
 *** '' indicates p < 0.001, ** '' indicates p < 0.01, *''
indicates p < 0.05, '.' indicates p < 0.1, and ' ' indicates p >
0.1.
 For example, the coefficient estimates for AGE, PAY_0,
BILL_AMT1, and PAY_AMT1 are highly significant (p <
0.001), indicating that these variables are strongly associated
with default.payment.next.month.
 PAY_AMT5, PAY_AMT4, and PAY_AMT6 have p-values
slightly above 0.05, suggesting they may have a weaker
association with default.payment.next.month.

Model Equation
The logistic regression equation can be constructed as follows:
logit(p) = -1.412 + 0.008812 * AGE + 0.5867 * PAY_0 + 0.1042 * PAY_2 + 0.1218 * PAY_3 -
0.00000531 * BILL_AMT1 + 0.00000338 * BILL_AMT3 - 0.00001491 * PAY_AMT1 -
0.00001212 * PAY_AMT2 - 0.000003455 * PAY_AMT6 - 0.000003057 * PAY_AMT5 -
0.000003074 * PAY_AMT4
The formula for the logit function is:
logit(p)=ln(
𝑃
1 −𝑃
)
Where:
p represents the probability of the event occurring.

Receiver Operating Characteristic curve
Figure 28: Area under the curve (AUC)
 The AUC (Area Under the Curve) of the ROC
curve quantifies the overall performance of the
model.
 With an AUC of 0.72: The model demonstrates
moderate discrimination ability.
 An AUC of 0.72 suggests that the model is
better than random chance but may still have
room for improvement.
 It implies that the model can distinguish
between the two classes (positive and
negative) with a reasonable degree of accuracy.

Model evaluation
Figure 30: Visualization of the quantile-quantile plot
 In Fig 30, the observed residuals are plotted against the
quantiles of a theoretical distribution (usually the
standard normal distribution).
 If the residuals are normally distributed, the points on
the QQ plot will fall approximately along a straight
line.
 As seen in Fig 30, there is a deviations from a straight
line suggest departures from normality in the residuals
 This isn’t a desirable characteristic. And it suggests that
the residuals are not normally distributed.
 This departures from normality in the residuals
indicate issues with the model assumptions and the
presence of outliers or influential data points.

Accuracy
Trained Model 80.84 %
Retrained Model 80.91 %

Recommendations
Bill Amount and previous
paymentd
The bill amounts (e.g., BILL_AMT1,
BILL_AMT3) and previous payment
amounts (e.g., PAY_AMT1,
PAY_AMT2) also play a significant
role in predicting default. Encouraging
timely bill payments and offering
flexible payment options can help
mitigate default risks.
Customer Segmentation
Utilize the insights from the logistic
regression model to segment customers
based on their risk profiles. This
segmentation can help prioritize
collections efforts, tailor marketing
campaigns, and customize financial
products to better meet the needs of
different customer segments.
Customer Assistance Program
Customer Assistance Programs: Implement
customer assistance programs or financial
counseling services to support customers
experiencing financial difficulties. Proactively
reaching out to at-risk customers and offering
them assistance can help prevent defaults and
foster customer loyalty.
Payment Status Importance
It is crucial for the business to closely
monitor customers' payment behavior,
especially when there are signs of
payment delays or defaults.
Age Factor
While the age variable (AGE)
has a relatively small coefficient,
it still contributes to the model's
predictive power. Understanding
the age distribution of customers
and how it correlates with
default rates can help tailor
marketing strategies or financial
products to different age groups.

Conclusion
 In conclusion, the logistic regression analysis and correlation
results provide valuable insights into the factors influencing
default payment next month in the dataset.
 The analysis highlights the significance of payment status, age,
bill amounts, and previous payments in predicting default. By
closely monitoring these factors and adapting strategies
accordingly, businesses can better manage default risks and
improve their financial stability.

Construction of a robust prediction model to forecast the likelihood of a credit card holder to experience payment defaults in upcoming months.

Recommended

Recommended

More Related Content

Similar to Construction of a robust prediction model to forecast the likelihood of a credit card holder to experience payment defaults in upcoming months.

Similar to Construction of a robust prediction model to forecast the likelihood of a credit card holder to experience payment defaults in upcoming months. (19)

Recently uploaded

Recently uploaded (20)

Construction of a robust prediction model to forecast the likelihood of a credit card holder to experience payment defaults in upcoming months.