2. AGENDA
PROJECT GOAL
A brief description of what we desire to achieve in this project
OVERVIEW OF DATA
Details regarding the dataset including source, size, type of variables
DATA QUALITY ISSUES
Explanation of the data quality issues we faced including missing data, outliers
and presence of insignificant variables
DATA CLEANING
DATA INSIGHTS
2
1
3
4
5
Steps to correct each of the above issues was undertaken to produce a clean
dataset to conduct further analysis on
The findings of analysis is presented through simple and effective plots
3. ● Lending Club is a US peer-to-peer lending
company
● It connects borrowers with investors
through online marketplace.
DATA SOURCE
Source: https://www.lendingclub.com/info/download-data.action
4. PROJECT GOAL
2 31
Explore different
variables explaining
the attributes
related to loan and
customers
Identify factors that
are important to
predict customer
default
Visualize findings
in a simple &
effective manner
6. 421,095 observation
143 variables
225 MB
Numeric
110
Categorical
33
Examples:
● Loan description provided by the
borrower
● Job title
● Home ownership
● Loan review status - approved/not
approved
Examples:
● Self reported annual income
● FICO range - high and low
● Loan amount
● Interest rate of loan
DATA DESCRIPTION
9. Variables with more than 50%
missing data were removed
● 64 variables
Variable Percent Missing(%)
Loan description by the borrower 100
Months since the last public record. 82
Months since most recent 90-day or
worse rating
71
The combined self-reported annual
income provided by the co-
borrowers
100
Co-borrowers' joint income was
verified by LC, not verified, or if the
income source was verified
100
MISSING DATA
10. Variables with less than 50% missing
data were imputed with median (low
impact) or min/max (penalize high/low
values)
● 18 total columns
○ 10 imputed
○ 8 - rows with missing values
removed
Variable
Percent
Missing(%)
Imputed By
DTI 0.04 Median
Months since last
delinquency
48 Maximum
Number of
Revolving Accounts
0.2 Median
Number of Current
Delinquent
Accounts
4 Minimum
MISSING DATA
11. INVALID DATA
● Variables - total_rec_late_fee had
invalid values
● There were 13 negative values that
were identified
● These values were later removed by
imputing them with 0 to replace by
mode of values for this indicator
12. OUTLIERS
12
• Outliers were replaced with the median value.
• Above graphs illustrate outlier treatment for the variable “Total revolving
high credit/credit limit”.
14. TOO MANY CATEGORIES/ONE CATEGORY
14
• Publicly available policy code had only 1 category
• Zip codes only included first three numbers and did not add any value
16. LOANS ISSUED BY REGION
● July and October had the highest amount of loans issued
● Southeast and northeast regions had the highest amount of loans issued
● Southwest region had the lowest amount of loans issued
18. CORRELATION OF IMPORTANT VARIABLES
Fico Score, Total payment, Total
recovery on principal amount are good
indicators of probability of default
A logistic regression classification
model could be built using the above
customer characteristics.
19. GOOD INDICATORS OF DEFAULT
In general, the default percentages are
directly proportional to the interest rates In general, the default percentages are
inversely proportional to the last
payment amount
20. INCOME CHARACTERISTICS
● Loan borrowers having high income
(i.e. >$200,000) took out higher loan
amount compared to people with lower
and medium incomes
● Loan borrowers having low income (i.e.
<$100,000) generally had lower
median employment years(5 years)
than the people with medium and high
income (7 years).
21. INCOME CHARACTERISTICS
● Loan borrowers irrespective of their
income have similar FICO score
distribution, implying fairness of score
● Loan borrowers having low income
have the highest interest rate, followed
by medium and high income groups
23. Lot of data cleaning and processing required to create an analysis ready dataset.
Limited period of data (12 months) for analysis.
Limited knowledge of the lending domain.
Fewer number of strongly correlated variables.
1
2
3
4
DATA CHALLENGES/LIMITATIONS
24. REFERENCES
Peer to Peer Lending & Alternative Investing. (n.d.). Retrieved from https://www.lendingclub.com/
Bachmann, J. A. (n.d.). Lending Club || Risk Analysis and Metrics. Retrieved from
https://www.kaggle.com/janiobachmann/lending-club-risk-analysis-and-metrics
Sheth, A. (n.d.). Analysis and Modelling of Lending Club loan data. Retrieved from
https://www.kaggle.com/adityasheth/analysis-and-modelling-of-lending-club-loan-data
Editor's Notes
California, Texas, New York and Florida are the states with the highest amount of loans issued.
California, Texas and New York are all above the average annual income (with the exclusion of Florida), this could be the probable reason why most loans are issued in these states.