Apanps5210 - final presentation

Aditi Wadhawan | Chinmayee Mohapatra | Malvika Elango
Manasa Damera | Vatsal Randhar
AN EXPLORATORY ANALYSIS

AGENDA
PROJECT GOAL
A brief description of what we desire to achieve in this project
OVERVIEW OF DATA
Details regarding the dataset including source, size, type of variables
DATA QUALITY ISSUES
Explanation of the data quality issues we faced including missing data, outliers
and presence of insignificant variables
DATA CLEANING
DATA INSIGHTS
2
1
3
4
5
Steps to correct each of the above issues was undertaken to produce a clean
dataset to conduct further analysis on
The findings of analysis is presented through simple and effective plots

● Lending Club is a US peer-to-peer lending
company
● It connects borrowers with investors
through online marketplace.
DATA SOURCE
Source: https://www.lendingclub.com/info/download-data.action

PROJECT GOAL
2 31
Explore different
variables explaining
the attributes
related to loan and
customers
Identify factors that
are important to
predict customer
default
Visualize findings
in a simple &
effective manner

421,095 observation
143 variables
225 MB
Numeric
110
Categorical
33
Examples:
● Loan description provided by the
borrower
● Job title
● Home ownership
● Loan review status - approved/not
approved
Examples:
● Self reported annual income
● FICO range - high and low
● Loan amount
● Interest rate of loan
DATA DESCRIPTION

DATA QUALITY ISSUES
&
DATA CLEANING

DATA QUALITY ISSUES
Missing Data
Outliers
Invalid Data
Too Many Categories One Category

Variables with more than 50%
missing data were removed
● 64 variables
Variable Percent Missing(%)
Loan description by the borrower 100
Months since the last public record. 82
Months since most recent 90-day or
worse rating
71
The combined self-reported annual
income provided by the co-
borrowers
100
Co-borrowers' joint income was
verified by LC, not verified, or if the
income source was verified
100
MISSING DATA

Variables with less than 50% missing
data were imputed with median (low
impact) or min/max (penalize high/low
values)
● 18 total columns
○ 10 imputed
○ 8 - rows with missing values
removed
Variable
Percent
Missing(%)
Imputed By
DTI 0.04 Median
Months since last
delinquency
48 Maximum
Number of
Revolving Accounts
0.2 Median
Number of Current
Delinquent
Accounts
4 Minimum
MISSING DATA

INVALID DATA
● Variables - total_rec_late_fee had
invalid values
● There were 13 negative values that
were identified
● These values were later removed by
imputing them with 0 to replace by
mode of values for this indicator

OUTLIERS
12
• Outliers were replaced with the median value.
• Above graphs illustrate outlier treatment for the variable “Total revolving
high credit/credit limit”.

OUTLIERS
After
replacing
with median
value
Variable:
The upper boundary
range the borrower’s last
FICO pulled belongs to.

TOO MANY CATEGORIES/ONE CATEGORY
14
• Publicly available policy code had only 1 category
• Zip codes only included first three numbers and did not add any value

LOANS ISSUED BY REGION
● July and October had the highest amount of loans issued
● Southeast and northeast regions had the highest amount of loans issued
● Southwest region had the lowest amount of loans issued

GEOGRAPHIC DISTRIBUTION OF LENDING CLUB ISSUED LOANS
17
CALIFORNIA
TEXAS
FLORIDA
NEW YORK

CORRELATION OF IMPORTANT VARIABLES
Fico Score, Total payment, Total
recovery on principal amount are good
indicators of probability of default
A logistic regression classification
model could be built using the above
customer characteristics.

GOOD INDICATORS OF DEFAULT
In general, the default percentages are
directly proportional to the interest rates In general, the default percentages are
inversely proportional to the last
payment amount

INCOME CHARACTERISTICS
● Loan borrowers having high income
(i.e. >$200,000) took out higher loan
amount compared to people with lower
and medium incomes
● Loan borrowers having low income (i.e.
<$100,000) generally had lower
median employment years(5 years)
than the people with medium and high
income (7 years).

INCOME CHARACTERISTICS
● Loan borrowers irrespective of their
income have similar FICO score
distribution, implying fairness of score
● Loan borrowers having low income
have the highest interest rate, followed
by medium and high income groups

INCOME AND AVERAGE INTEREST ON LOAN PURPOSE
22

Lot of data cleaning and processing required to create an analysis ready dataset.
Limited period of data (12 months) for analysis.
Limited knowledge of the lending domain.
Fewer number of strongly correlated variables.
1
2
3
4
DATA CHALLENGES/LIMITATIONS

REFERENCES
Peer to Peer Lending & Alternative Investing. (n.d.). Retrieved from https://www.lendingclub.com/
Bachmann, J. A. (n.d.). Lending Club || Risk Analysis and Metrics. Retrieved from
https://www.kaggle.com/janiobachmann/lending-club-risk-analysis-and-metrics
Sheth, A. (n.d.). Analysis and Modelling of Lending Club loan data. Retrieved from
https://www.kaggle.com/adityasheth/analysis-and-modelling-of-lending-club-loan-data

Apanps5210 - final presentation

Recommended

Recommended

More Related Content

Similar to Apanps5210 - final presentation

Similar to Apanps5210 - final presentation (20)

Recently uploaded

Recently uploaded (20)

Apanps5210 - final presentation

Editor's Notes