Presented by: Nicia Dias
PROJECT
LOAN
PREDICTION
ABOUT THE PROJECT
Financial loan services are leveraged by companies across many
industries, from big banks to financial institutions to government
loans. One of the primary objectives of companies with financial
loan services is to decrease payment defaults and ensure that
individuals are paying back their loans as expected. In order to do
this efficiently and systematically, many companies employ
machine learning to predict which individuals are at the highest
risk of defaulting on their loans, so that proper interventions can
be effectively deployed to the right audience.
The dataset contains 255,347 rows and 18 columns in total.
It is a binary classification problem to determine if a borrower is
default or non-default.
Statistics of the Numerical Columns
Statistics of the Categorical Columns
No
duplicate
or null
values
were to
be found.
The ‘DEFAULT’ column is the Dependent column
indicating if a person is a defaulter or not.
Analysis of the Default Columns
In target variable,the classes are imbalanced in which
11.62% customers defaulted their loan and 88.38%
customers are not defaulted their loan.
Analysis of the Numerical value Column
Age Column:
The younger individuals (in the 20-29 and 30-39 age categories) have a higher
frequency of defaults compared to older individuals. The frequency of defaults
decreases as age increases, with the lowest default rates observed among those
aged 60-69. This shows that age could be a significant factor in predicting loan
default risk, with younger age groups potentially representing a higher risk profile
for lenders.
Income Column:
Here we can see that the higher the income the more
the customers are not in default. Also the Default
columns start decreasing a bit as the income gets
higher.
Loan Term column:
The balance between the
defaulted and not defaulted
remains the same for the loan
taken for different terms from
1 year to 5 years.
Credit Score Column:
The majority of the entries fall into the credit
score range of 300-579. The default rate is
highest in the lowest credit score range (300-
579) and decreases as the credit score range
increases(740-799 and 800-850).
Months Employed Column:
No. of Credit Lines column:
We can see that as the
customers have worked for
more number of years, the
Default rates start to
decrease gradually and the
Not Default rates start to
increase.
As the number of credit lines increases, the
default rate tends to rise as well. Borrowers
with fewer credit lines (1 or 2) have relatively
lower default rates compared to those with
higher numbers of credit lines (3 or 4).
Loan Amount Column:
As the loan amount increases, the default rate tends to rise as well. The default rate for loans
less than 30k is relatively low and gradually increases as the loan amount categories increase.
Notably, the default rate spikes in the higher loan amount categories, particularly from 180k
onwards.
Debt to Income Column:
Borrowers with a higher DTI percentage (>43%) exhibit a significantly higher default rate
compared to those with lower DTI percentages (<36% and 37-43%). The default rate increases
as the DTI percentage increases, indicating a strong association between high DTI levels and
default propensity. But also as the DTI% is more than 43% there are more Non Default
borrowers.
Analysis of the Categorical value Columns
The target
variable classes
are almost
equally
distributed
among all
categories of
feature
variables.It is a
good sign which
indicates each
and every feature
in the dataset are
related with the
target.
Education Column:
Across all education categories, the majority of individuals have credit scores below 670. The
distribution of credit scores is relatively similar across different education levels.
Loan Purpose Column:
The statistics of taking the loan is almost
similar across different purposes
Divorced individuals have a higher
proportion of defaults compared to their
non-defaults. Marital status appears to
have some correlation with loan default,
with Divorced individuals being at a
relatively higher risk of default compared
to others.
Marital Status Column:
Employment Type Column:
Full-time and Part-time employment have the
highest numbers of both defaulting and not
defaulting individuals. Self-employed individuals
have the lowest default count, indicating a
slightly better financial stability. Unemployed
individuals have the highest default count,
which is expected given the lack of regular
income.
Mortage status Column:
The default rate is slightly higher
among individuals with mortgages
compared to those without.
Dependents Status Column:
The group with no dependents has a slightly
higher count of defaulting than the group with
dependent. Having dependents seems to
slightly decrease the likelihood of defaulting,
as the group with dependents has a lower
proportion of defaulters compared to the
group without dependents.
The presence of a co-signer appears to
have a positive impact on loan repayment,
as fewer defaults are observed among
individuals with co-signers compared to
those without co-signers.
Cosigner Status Column:
Correlation Table
The strongest positive correlation with
default status is observed with the 'Age'
of the borrower, indicating that younger
individuals are more likely to default.
'Income' and 'MonthsEmployed' also
show positive correlations with default,
suggesting that lower income and
shorter employment years are
associated with higher default rates.
Factors such as 'HasCoSigner',
'HasDependents', and 'CreditScore' show
weaker positive correlations with
default.
'InterestRate' shows the strongest
negative correlation with default,
implying that higher interest rates are
associated with lower default rates.
Other features such as 'LoanAmount'
and 'EmploymentType' also show
negative correlations with default,
although these correlations are relatively
weaker compared to age and income.
MACHINE LEARNING
Decision Tree Model
01
Logistic Regression
Model
02
XG BOOST Model
04
Naive Bayes Model
05
Random Forest
Model
03
Models Used:
Decision Tree
Model
Logistic Regression
Model
Random Forest
Model
Naive Bayes
Model
XG BOOST
Model
Conclusion:
Here Logistic Regression, Random Forest,
XG Boost and Naive Bayes Models have
alomost similiar accuracy rate but there
is only points differce which make
Random Forest the best Model.
We can futher use it for Deployment.
Upon trying the SVM model as well, that
is the only model which took above 10
minutes to load its accuracy rate, hence
have dropped that model.
OF THE RANDOM FOREST MODEL
INPUT
OUTPUT
THANK
YOU

Predicting Loan Approval: A Data Science Project

  • 1.
    Presented by: NiciaDias PROJECT LOAN PREDICTION
  • 2.
    ABOUT THE PROJECT Financialloan services are leveraged by companies across many industries, from big banks to financial institutions to government loans. One of the primary objectives of companies with financial loan services is to decrease payment defaults and ensure that individuals are paying back their loans as expected. In order to do this efficiently and systematically, many companies employ machine learning to predict which individuals are at the highest risk of defaulting on their loans, so that proper interventions can be effectively deployed to the right audience.
  • 3.
    The dataset contains255,347 rows and 18 columns in total. It is a binary classification problem to determine if a borrower is default or non-default.
  • 4.
    Statistics of theNumerical Columns Statistics of the Categorical Columns No duplicate or null values were to be found. The ‘DEFAULT’ column is the Dependent column indicating if a person is a defaulter or not.
  • 5.
    Analysis of theDefault Columns In target variable,the classes are imbalanced in which 11.62% customers defaulted their loan and 88.38% customers are not defaulted their loan.
  • 6.
    Analysis of theNumerical value Column
  • 7.
    Age Column: The youngerindividuals (in the 20-29 and 30-39 age categories) have a higher frequency of defaults compared to older individuals. The frequency of defaults decreases as age increases, with the lowest default rates observed among those aged 60-69. This shows that age could be a significant factor in predicting loan default risk, with younger age groups potentially representing a higher risk profile for lenders.
  • 8.
    Income Column: Here wecan see that the higher the income the more the customers are not in default. Also the Default columns start decreasing a bit as the income gets higher. Loan Term column: The balance between the defaulted and not defaulted remains the same for the loan taken for different terms from 1 year to 5 years.
  • 9.
    Credit Score Column: Themajority of the entries fall into the credit score range of 300-579. The default rate is highest in the lowest credit score range (300- 579) and decreases as the credit score range increases(740-799 and 800-850).
  • 10.
    Months Employed Column: No.of Credit Lines column: We can see that as the customers have worked for more number of years, the Default rates start to decrease gradually and the Not Default rates start to increase. As the number of credit lines increases, the default rate tends to rise as well. Borrowers with fewer credit lines (1 or 2) have relatively lower default rates compared to those with higher numbers of credit lines (3 or 4).
  • 11.
    Loan Amount Column: Asthe loan amount increases, the default rate tends to rise as well. The default rate for loans less than 30k is relatively low and gradually increases as the loan amount categories increase. Notably, the default rate spikes in the higher loan amount categories, particularly from 180k onwards.
  • 12.
    Debt to IncomeColumn: Borrowers with a higher DTI percentage (>43%) exhibit a significantly higher default rate compared to those with lower DTI percentages (<36% and 37-43%). The default rate increases as the DTI percentage increases, indicating a strong association between high DTI levels and default propensity. But also as the DTI% is more than 43% there are more Non Default borrowers.
  • 13.
    Analysis of theCategorical value Columns The target variable classes are almost equally distributed among all categories of feature variables.It is a good sign which indicates each and every feature in the dataset are related with the target.
  • 14.
    Education Column: Across alleducation categories, the majority of individuals have credit scores below 670. The distribution of credit scores is relatively similar across different education levels.
  • 15.
    Loan Purpose Column: Thestatistics of taking the loan is almost similar across different purposes Divorced individuals have a higher proportion of defaults compared to their non-defaults. Marital status appears to have some correlation with loan default, with Divorced individuals being at a relatively higher risk of default compared to others. Marital Status Column:
  • 16.
    Employment Type Column: Full-timeand Part-time employment have the highest numbers of both defaulting and not defaulting individuals. Self-employed individuals have the lowest default count, indicating a slightly better financial stability. Unemployed individuals have the highest default count, which is expected given the lack of regular income. Mortage status Column: The default rate is slightly higher among individuals with mortgages compared to those without.
  • 17.
    Dependents Status Column: Thegroup with no dependents has a slightly higher count of defaulting than the group with dependent. Having dependents seems to slightly decrease the likelihood of defaulting, as the group with dependents has a lower proportion of defaulters compared to the group without dependents. The presence of a co-signer appears to have a positive impact on loan repayment, as fewer defaults are observed among individuals with co-signers compared to those without co-signers. Cosigner Status Column:
  • 18.
    Correlation Table The strongestpositive correlation with default status is observed with the 'Age' of the borrower, indicating that younger individuals are more likely to default. 'Income' and 'MonthsEmployed' also show positive correlations with default, suggesting that lower income and shorter employment years are associated with higher default rates. Factors such as 'HasCoSigner', 'HasDependents', and 'CreditScore' show weaker positive correlations with default. 'InterestRate' shows the strongest negative correlation with default, implying that higher interest rates are associated with lower default rates. Other features such as 'LoanAmount' and 'EmploymentType' also show negative correlations with default, although these correlations are relatively weaker compared to age and income.
  • 19.
    MACHINE LEARNING Decision TreeModel 01 Logistic Regression Model 02 XG BOOST Model 04 Naive Bayes Model 05 Random Forest Model 03 Models Used:
  • 20.
  • 21.
  • 22.
    XG BOOST Model Conclusion: Here LogisticRegression, Random Forest, XG Boost and Naive Bayes Models have alomost similiar accuracy rate but there is only points differce which make Random Forest the best Model. We can futher use it for Deployment. Upon trying the SVM model as well, that is the only model which took above 10 minutes to load its accuracy rate, hence have dropped that model.
  • 23.
    OF THE RANDOMFOREST MODEL INPUT OUTPUT
  • 24.