2. 2
PKC
Loan Risk Assessment & Scoring Model
Probability scores can be assigned to each client to predicate loan defaults.
Equation: Probability to default = ez/1+ez, where e=2.718
z = -1.56+ 0.02*(Age)+ (-0.0083)* (Average Salary) +(-0.0018) *Total Assets +
(-0.00001)*Total Loan+ (-4.33)* Assets to Loan Ratio + 0.75(Male) + (-3.3)*Private
+ 2.1*EU&CA + 4.1*North America + 1.8*Sahara & Africa
Conclusion
To develop Statistical Prediction Model based on historical data of loan defaulters
(Bad loans) and non-defaulter (Good Loans).
Calculate probability of default for each client.
Objective
Banking Data of all clients who have loan balance for the month of November 2015.
Data Source
Payment history, Credit Utilization, Credit History, Credit Use & Assets of client with other
banks are not available which can be useful for loan scoring model.
Default percentage in available data differs from the actual default percentage, which
suggest the data is incomplete.
Data Limitations
To predict if borrowers are likely to default on their loans or not, two classifications are
created as Bad Loans and Good Loans.
Later created a statistical model for this binary variable with the Logistic Regression along
with all the available demographic and banking variables as attributes.
Approach
3. 3
PKC
Data suggests that some countries have higher weightage in the current active
loan clearance model of the bank but default rate is very high of these countries in
Nov'2015. There is a possibility that these clients get the loans easily compared to
other countries and later they default without affecting their credit rating in their home
country.
Some examples of defaults by country name:
United States : 50% (11 defaults of total 22 Loan Clients)
United Kingdom : 36% (5 defaults of total 14 Loan Clients)
Somali : 43% (3 defaults of total 7 Loan Clients)
Romania : 33% (2 defaults of total 6 Loan Clients)
Maldives : 29% (4 defaults of total 14 Loan Clients)
Canada : 19% (3 defaults of total 16 Loan Clients)
Lebanese : 11% (24 defaults of total 213 Loan Clients)
Germany : 50% (1 default of total 2 Loan Clients)
However, total default percent as per the available data is only 2.05% in terms of count of
clients and 2.53% in terms of amount
Key Findings & Statistical Measures
4. 4
PKC
Monthly salaries of each client is one of the important factors to predict loan
defaults. A risk indicator can be generated for clients whose salary is not being
credited per month because loan scoring model suggests high probability of these
customers to default.
One of the interesting finding is that employees of government sector are more
likely to default in comparison to private and semi-government loan clients.
Assets(Demand Deposit & Time Deposit) available in the data is one of key
variables to predict likelihood of clients to default. As value of the client’s assets
increases probability to default decreases significantly.
Key Findings & Statistical Measures
Total Loan Clients Default Default Percentage
Government 4291 127 2.96%
Private 12158 78 0.64%
Semi-Government 65 0 0.00%
Non-Kuwait Kuwait
Defaults 198 153
Salary Missing 198 142
Percent 100% 92.81%
5. 5
PKC
Conclusion & Future Prospects
The Loan Score model based on available data demonstrates statistical scope to
predict loan defaults and provide significant risk assessment measure to differentiate
between good & bad loans.
Enrichment of data with payment history, Credit Utilization, Credit History, Credit Use
& assets of client with other banks will further improve the loan scoring model.
We can also create a loan scoring model for the prospective customs and target
customer who have very low probability to default and thus reduces the risk to
default and maximizes bank’s profit.
With the availability of all the demographic, credit and asset data we can create
individual models based on geography , amount of loan and number of clients.
Example:
i. Home Client Loan Scoring Model (Clients of Kuwait)
ii. Indian Client Loan Scoring Model (Indian Loan Client are highest in terms of count)
iii. Out-of-Home Client Loan Scoring Model (For other countries & geographies)
7. 7
PKC
Conceptualization: Modeling steps of Logistic Regression
Data Access &
Manage
Target Variable
Creation
Variable
Transformation
Dummy
Variable
Creation
Data Partition
in Training &
Validation
Model
Calibration
Lift Chart
Comparison
Model Creation
on entire data
Decide
Probability
Cut-off
Model
Validation
8. 8
PKC
Conceptualization: Modeling steps of Logistic Regression
B. Variable Creation (Target)
Creation of target variable: Clients for whom Legal Loan is available is considered as Bad
Loans and assigned a value ‘1’ and clients for whom Legal Loan is not available is considered
Good Loans and assigned a value ‘0’ .
A. Data Access and Management in SAS
Import and Merge(Join) available Banking Loan data of November 2015), Client Level data, Client
Salary data and Employment data.
C. Variable Transformation to categorize variables into less categories for
variables Country name and Profession.
Country Names into Country Categories as per region: East Asia & Pacific, Europe & Central
Asia, Latin America & Caribbean, Middle East & North Africa, North America, South Asia, Sub-
Saharan Africa and Others
Profession into Professional Categories:
Blue-Collar (like Clerk, Technician, Driver & others)
High-risk (Policemen, Army man & others)
White-Collar (Doctors, Economist, Auditor & others)
9. 9
PKC
Conceptualization: Modeling steps of Logistic Regression
D. Conversion of Categorical Character variables in Numeric Dummy Variables
for Gender, Professional category, Employment Type & Country Classification
Example: Employment Type Government Private Semi-Government
Government 1 0 0
Private 0 1 0
Semi-Government 0 0 1
Missing 0 0 0
E. Data partition: Entire available client data of 17,096 is divided in two parts.
Training Data-Set (10,258): 60% of all the observations.
Validation Data-Set (6,838): 40% of all the observation.
Random uniform distribution has been used to randomly select
observation in each Data-Set.
Full Data
Validation Data
40% of
Population
Test Data
60% of
Population
10. 10
PKC
Conceptualization: Modeling steps of Logistic Regression
F. Model Calibration:
i. Logistic Regression is used on Testing Data-Set to generate the estimates for each
independent variable along with intersect.
ii. The binary logistic model is used to estimate the probability of a binary
response (1s as Good Loans & 0s as Bad Loans) based on predictor (or
independent) variables as attributes or features.
iii. Multi-Co-linearity Check among independent variables through VIF (Variance
Inflationary Factor). Variable with VIF higher then 5 are removed from the model.
where R2
i is the coefficient of determination of the regression equation.
iv. Highly collinear continuous or dummy variables such as Female, Government,
Semi- Government, South Asia, Middle East & North Africa , East Asia & Pacific,
Loan Available (Salary*15 or 15000 – Existing Loan) and All Age-Groups are
removed from the model to eliminate Multi-Co-linearity.
11. 11
PKC
Conceptualization: Modeling steps of Logistic Regression
F. Model Calibration (Continued):
v. Model is created with stepwise method by considering significance level to enter
and stay in the model as 0.05 (95% confidence level). Below are the estimate of all
the significant variables along with intercept.
Logit(Target) = b0 + b1*X1 + b2*X2+…+bn*Xn,
Where Logit(Target) = log[Prob(Target=1| X1, X2, …, Xn) /
Prob(Target=0| X1, X2, …, Xn)]
And b0, b1,…, bn are the Estimates/Betas
Parameters Estimate Pr > ChiSq
Intercept -1.46170 0.0004
Age 0.01870 0.0271
Average Salary -0.00829 <.0001
Total Loan -0.00001 0.0038
Assets to Loan Ratio -7.46580 <.0001
Male 0.72240 0.0013
Private -3.41490 <.0001
Europe and Central Asia 1.90480 0.0113
North America 3.57640 <.0001
ez
Probability to default(1) = --------
1+ez
Where,
e = 2.71828
z = b0 + b1*X1 + b2*X2+…+bn*Xn
12. 12
PKC
Conceptualization: Modeling steps of Logistic Regression
F. Model Calibration (Continued):
vi. Parameter estimates are used to calculate probability associated with each loan
client to default to Training Data-set (60%).
ez
Probability to default(1) = ------------
1+ez
where, Z = -1.46+ 0.018*(Age)+ (-0.0083)* (Average Salary)+ (-0.00001)*Total Loan+
(-7.47)* Assets to Loan Ratio + 0.72(Male) + (-3.4)*Private + 1.9*EU&CA + 3.6*North America
vii. These parameter estimates are used to calculate probability associated with each
loan client to default on Validation Data-set (40%) also.
Below are the reference files of Training and Validation data sets with all calculations.
13. 13
PKC
Conceptualization: Modeling steps of Logistic Regression
G. Model Validation: Comparing Training & Validation with Lift Chart:
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 2 3 4 5 6 7 8 9 10 11
Lift Chart
Cummulative Percent (Training) Cummulative Percent (Validation)
Cummulative (Without Model)
Lift Chart suggests that both
the Training dataset and
validation dataset are in
alignment with parameter
estimates. Thus validating the
correctness of the model.
Association of Predicted Probabilities and Observed
Responses
Percent Concordant 95.6 Somers' D 0.930
Percent Discordant 2.6 Gamma 0.946
Percent Tied 1.7 Tau-a 0.038
Pairs 2166770 c 0.965
Percent Concordant suggest
good percentage of correct
predication (95.6%). Somers’D
& Gamma suggests model
has a significant predictive
power (0.93/0.94). Area
under curve i.e. c is 0.96
which is close to 1.0.
14. 14
PKC
Conceptualization: Modeling steps of Logistic Regression
H. Model Creation on the entire data-set (10796): Now the parameter estimates are
created on the entire data-set by following all the previous steps. Below are the
estimate of all the significant variables along with intercept.
Z = -1.56+ 0.02*(Age)+ (-0.0083)* (Average Salary) +(-0.0018)*Total Assets
+(-0.00001)*Total Loan+ (-4.33)* Assets to Loan Ratio + 0.75(Male) + (-3.3)*Private + 2.1*EU&CA
+ 4.1*North America + 1.8*Sahara & Africa
Parameters Estimate Pr > ChiSq
Intercept -1.5571 <.0001
Age 0.0201 0.0028
Average Salary -0.00831 <.0001
Total Assets -0.0018 0.0009
Total Loans -0.00001 0.0001
Assets to Loan ratio -4.3266 0.0155
Male 0.7518 <.0001
Private -3.2798 <.0001
Europe and Central Asia 2.0924 0.0009
North America 4.0713 <.0001
Sub Sahara & Africa 1.7926 0.0114
ez
Probability to default(1) = ----------
1+ez
where, e = 2.71828,z = b0 + b1*X1 + b2*X2+…+bn*Xn
15. 15
PKC
Conceptualization: Modeling steps of Logistic Regression
I. Probability cut-off to accept or reject a loan application: Decide the probability
level above which loans must be rejected and below level would be accepted.
P>0.40 Predicted
Actual 0 1 TotalPercent
0 16594 151 16745 99.1%
1 184 167 351 47.6%
Total 16778 318 17096
P>0.30 Predicted
Actual 0 1 TotalPercent
0 16458 287 16745 98.3%
1 96 255 351 72.6%
Total 16554 542 17096
P>0.25 Predicted
Actual 0 1 TotalPercent
0 16412 333 16745 98.0%
1 84 267 351 76.1%
Total 16496 600 17096
These tables provides frequency
distribution of correct and in-
correct predictions at 0.4,0.3 &
0.25 probabilities.
Best probability cut-off needs to
be decided to minimizing risk
and maximizing profit.
If target is to acquire more
customers then 0.30 cut-off is
appropriate other-wise 0.25 cut-
off is good enough to reduce
risk.
16. 16
PKC
Conceptualization: Modeling steps of Logistic Regression
J. Model Validation: ROC Curve
Accuracy of the model is measured
by the Area under ROC cure. An
area of 1 represents a perfect test.
In our model area under ROC
curve at last step is 0.96 and it
would be considered as “very good"
at separating good loans versus
bad loans.