Manifold learning for credit risk assessment

Outline
Motivation
Dimensionality Reduction
Proposed approach
Experimental setup
Conclusions and Future Work

Bankrupcy Analysis for Credit Risk
using Manifold Learning

B Ribeiro, A Vieira, J Duarte, C Silva, J Carvalho das Neves,
University of Coimbra, ISEP and ISEG, Portugal
and
Q Liu, A H Sung
New Mexico Tech, USA

November, 2008

ICONIP 2008

Outline
Motivation
Proposed approach
Experimental setup

1 Motivation
2 Dimensionality Reduction
Manifold Learning
Isomap
Supervised Isomap
3 Proposed approach
Overview
Operation
4 Experimental setup
Data set
Evaluation metrics
Results
5 Conclusions and Future Work

ICONIP 2008

Outline
Motivation
Proposed approach
Experimental setup

Credit Risk Analysis

Predicting bankruptcy has been a very important topic in
accounting and ﬁnance attracting considerable research both
from academic and business areas
The question of how to determine the credit-worthiness of a
customer or how safe is to grant credit remains a main
concern for banks and investors, particularly, with the recent
ﬁnancial crisis

ICONIP 2008

Outline
Motivation
Proposed approach
Experimental setup

Importance of Risk (1)

ICONIP 2008

Outline
Motivation
Proposed approach
Experimental setup

Importance of Risk (2)

ICONIP 2008

Outline
Motivation
Proposed approach
Experimental setup

Problem deﬁnition

The problem of bankruptcy prediction can be addressed as
follows:
Given a set of ﬁnancial ratios describing the situation of a
company over a given period, predict the probability that this
company may become bankrupted in a near future, normally during
the following year

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Objectives of dimensionality reduction

Nonlinear dimensionality reduction permits severe reduction
on the feature space
A direct consequence of nonlinear dimension reduction is the
visualization of data which can help to reveal the data
structures
Aims at choosing from the available set of features, a smaller
set that more eﬃciently represents the data
Supervised or unsupervised
Supervised methods use the label of the training examples in
the reduction step and usually perform better

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Introduction

Emerging technique that estimates a low-dimensional
structure, embedded in high-dimensional data
The underpinning idea is to invert a generative model for a
given set of observations
Manifold learning can be used as a pre-processing technique
to tackle the curse of dimensionality

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Formulation

Given data points x1 , x2 , · · · , xn ∈ IRD , we assume that the
data lies on a d-dimensional M manifold embedded into IRD ,
where d < D
A manifold M can be described by a single coordinate chart
f : M −→ IRd . The manifold learning consists of ﬁnding
y1 , · · · yn ∈ IRd , where yi = f (xi ).

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Isomap Algorithm

1 Estimates which points are neighbors on the manifold M,
based on the distances dX (i, j) between pairs of points i, j in
the input space X by computing the weighted graph G of
neighborhood relations given by the edges of weight dX (i, j).
2 Estimates the geodesic distances between all pairs of data
points in the manifold M by computing the shortest path
distance on the k’s nearest neighbor graph built on the data
set.
3 Applies classical MDS to the matrix of graph distances
DG = {dG (i, j)}, constructing an embedding of the data in a
d-dimensional Euclidean space Y that best preserves the
manifolds estimated intrinsic geometry

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Analysis

Isomap assumes that there is an isometric chart that preserves
distances between points.
If xi and xj are two points in the manifold M embedded into
IRD and the geodesic distance between them is dG (xi , xj ) ,
then there is a chart f : M −→ IRd such that
||f (xi ) − f (xj )|| = dG (xi , xj )
For nearby points in the high-dimensional space the Euclidean
distance is a good approximation of the geodesic distance
whereas for distant points this is not true

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Image Processing Example

[J. Tenenbaum, de Silva, & Langford, 2000]
ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Analysis

A weighted graph with k’s nearest neighbors is built where its
edges are weighted by the Euclidean distances between nearby
data points
Then a shortest path computation algorithm such as,
Dijkstra’s or Floyd’s, will complete the calculus of the
remainder geodesic distances.
MDS is then used to estimate the points whose Euclidean
distance equal the geodesic distances. Given a matrix
D ∈ IRn×n of dissimilarities, MDS constructs a set of points
whose interpoint Euclidean distances match those in D closely.

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Supervised version

The training labels are used to refine the distances between
inputs, since both classification and visualization can benefit
when the inter-class dissimilarity is larger than the intra-class
dissimilarity
The mapping function given by Isomap is only implicitly
defined and nonlinear interpolation techniques, such as GRNN
have to be used to learn it
This can also make the algorithm overfit the training set and
can often make the neighborhood graph of the input data
disconnected

ICONIP 2008

Outline
Motivation
Manifold Learning
Isomap
Proposed approach
Supervised Isomap
Experimental setup

Determining distances

The Euclidean distance dij = d(xi , xj ) between two given
observations xi and xj , labeled ci and cj respectively, is
replaced by a dissimilarity measure:

((a − 1)/a)1/2 if ci = cj
D(xi , xj ) = (1)
a1/2 − d0 if ci = cj
2
where a = 1/e −dij /σ with dij set to one of the distance measures
described above, σ is a smoothing parameter (set according to the
data ’density’), do is a constant (0 ≤ d0 ≤ 1) and ci , cj are the
data class labels.

ICONIP 2008

Outline
Motivation
Dimensionality Reduction Overview
Proposed approach Overview
Experimental setup

S-Isomap Semi Supervised Approach

ICONIP 2008

Outline
Motivation
Dimensionality Reduction Overview
Proposed approach Overview
Experimental setup

Testing instances

When a reduced space is reached, our aim is to learn a
kernel-based model that can be applied for testing new cases
of failed and non-failed ﬁrms
For testing, however, Isomap does not provide an explicit
mapping in the embedded mapping. Therefore we can not
generate the test set directly, since we would need to use the
labels
We use a generalized regression neural network (GRNN) to
learn the mapping, before the SVM prediction phase takes
place

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Diane database

Financial statements of French companies, initially of 60,000
industrial French companies, for the years of 2002 to 2006,
with at least 10 employees
3,000 were declared bankrupted in 2007 or presented a
restructuring plan
30 financial ratios which allow the description of firms in
terms of the financial strength, liquidity, solvability,
productivity of labor and capital, margins, net profitability and
return on investment

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Financial ratios

1. Number of employees 2. Financial Debt/Capital Employed %
3. Capital Employed/Fixed Assets 4. Depreciation of Tangible Assets
5. Working capital/current assets 6. Current ratio
7. Liquidity ratio 8. Stock Turnover days
9. Collection period 10. Credit Period
11. Turnover per Employee 12. Interest/Turnover
13. Debt Period days 14. Financial Debt/Equity
15. Financial Debt/Cashflow 16. Cashflow/Turnover
17. Working Capital/Turnover (days) 18. Net Current Assets/Turnover (days)
19. Working Capital Needs/Turnover 20. Export
21. Value added per employee 22. Total Assets/Turnover
23. Operating Profit Margin 24. Net Profit Margin
25. Added Value Margin 26. Part of Employees
27. Return on Capital Employed 28. Return on Total Assets
29. EBIT Margin 30. EBITDA Margin

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Preprocessing

Many cases with missing values, especially for defaults
companies
Default cases sorted out by the number of missing values.
Examples with 10 missing values at most were considered
600 default examples was obtained
To balance the dataset we selected randomly 600 non-default
examples

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Preprocessing

For the ratios of the years 2003 and 2006, each missing value
was replaced by the closest available year value
For 2004 and 2005, if values of the next and previous years
were available, each missing value was replaced by their mean,
otherwise it was replaced by the remaining value
In some cases there was no data available for a ratio in any of
the years. In this very few cases the missing data was replaced
by the median value of the ratio in each year
All ratios were logarithmized and then standardized to zero
mean and unity variance

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Historical data

Companies are often subjected to fluctuation of the market,
economy cycles and unavoidable contingencies related to its
business activity
Yearly variations of important financial ratios reflecting the
balance sheet, sometimes quite relevant, are common
particularly for small companies
We included information from the past 3 years preceding the
default. The number of inputs is therefore increased from 30
to 90 ratios
More relevant than the ratios themselves, are the variations
that occur over the period range of the analysis.

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Contingency table and error measures

Class Positive Class Negative
Assigned Positive tp fp
(True Positives) (False Positives)
Assigned Negative fn tn
(False Negatives) (True Negatives)

tp tp
Recall ( tp+fn ) and Precision ( tp+fp )
fp
Error type I ( fp+tn ) - % of companies classiﬁed as bankrupt
when in reality they are healthy
fn
Error type II ( fn+tp ) - % number of samples classiﬁed as
healthy when they are observed to be bankrupt
fp+fn
Error Rate - tp+fp+fn+tn )

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Trustworthiness

A projection is trustworthy if the set of the k nearest
neighbors of each data point in the low-dimensional space are
also close-by in the original space:
N
2
M(k) = 1 − (r (i, j) − k), (2)
Nk(2N − 3k − 1)
i=1 j∈Uk (i)

where r (i, j) is the rank of the data point j in the ordering
according to the distance from i in the original data space, and
Uk (i) denotes the set of those data points that are among the
k-nearest neighbors of the data point i in the low-dimensional
space but not in the original space.

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Visualization

Trustworthiness with S-Isomap
0.95
nldr=3
nldr=5
nldr=10
0.9

Trustworthiness
0.85

0.8

0.75

0.7
3 4 5 7 10 15 20 40 60 80 100 150 200
K Nearest Neighbors

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

S-ISOMAP with k-Nearest Neighbors in Historical
2006-2005 Data Set

k KNN SVM
Test Acc Error TypeI Error TypeII Test Acc Error TypeI Error TypeII
3 89.20±1.35 9.05±2.60 12.56±1.60 89.55±1.01 10.31±2.30 10.62±1.98
4 88.13±1.23 9.52±1.71 14.24±1.68 88.78±1.25 9.84±1.27 12.59±1.66
5 88.35±2.06 10.21±1.85 12.97±2.93 88.68±1.94 10.51±1.86 12.07±2.51
7 89.33±1.71 8.35±2.49 13.05±2.24 89.93±1.41 8.92±2.23 11.25±1.73
10 89.30±0.89 8.86±1.74 12.50±2.18 89.90±1.61 9.01±2.10 11.13±2.52
15 88.35±1.70 8.78±2.21 14.48±3.63 89.30±1.49 8.74±1.79 12.65±2.44
20 87.90±0.98 8.66±2.04 15.74±2.84 88.95±1.44 9.13±1.82 13.05±2.79
40 88.33±0.97 9.59±1.15 13.76±1.86 89.20±1.22 9.57±1.40 12.00±1.47
60 88.75±0.93 8.02±1.89 14.52±2.38 89.13±0.68 9.02±1.55 12.77±2.17
80 89.15±0.78 8.57±1.63 13.05±2.55 89.93±1.05 9.06±1.22 11.02±2.30
100 89.10±1.04 8.80±2.87 12.96±1.98 89.40±1.23 9.15±2.89 12.02±1.56
150 88.23±1.39 9.42±1.86 14.04±1.63 88.50±1.38 10.32±2.38 12.61±1.53
200 89.13±1.71 8.29±1.11 13.12±2.77 89.33±1.85 9.36±1.05 11.99±2.99

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Performance Measures on Diane Financial Data Sets

S-Isomap Train Test Recall Precision ErrorTypeI ErrorTypeII
2006 91.85 ± 0.54 87.73 ± 1.54 86.79 ± 2.62 87.94 ± 1.96 11.30 ± 1.91 13.21 ± 2.62
2005 78.70 ± 0.91 77.08 ± 2.02 77.13 ± 2.66 76.64 ± 3.62 22.87 ± 3.37 22.87 ± 2.66
2006-2005 94.26 ± 0.41 89.55 ± 1.01 89.38 ± 1.98 89.72 ± 1.94 10.31 ± 2.30 10.62 ± 1.98
2005-2004 96.74 ± 0.27 79.65 ± 1.42 77.61 ± 2.71 80.61 ± 2.12 18.38 ± 2.79 22.39 ± 2.71
KNN Train Test recall precision errorTypeI errorTypeII
2006 90.92 ± 0.76 85.77 ± 1.68 77.95 ± 3.29 92.51 ± 2.00 6.32 ± 1.69 22.05 ± 3.29
2005 84.78 ± 0.76 76.86 ± 1.71 73.22 ± 3.33 79.02 ± 1.98 19.46 ± 1.66 26.78 ± 3.33
2006-2005 91.18 ± 1.00 86.09 ± 1.88 76.99 ± 3.87 94.22 ± 3.03 4.74 ± 2.81 23.01 ± 3.87
2005-2004 84.39 ± 0.81 75.60 ± 1.79 64.80 ± 3.50 82.72 ± 1.65 13.58 ± 1.38 35.20 ± 3.50
SVM Train Test recall precision errorTypeI errorTypeII
2006 95.09 ± 0.42 90.54 ± 1.28 89.33 ± 2.24 91.73 ± 1.76 8.19 ± 1.90 10.67 ± 2.24
2005 86.06 ± 0.76 81.63 ± 1.76 81.01 ± 3.81 82.42 ± 2.84 17.64 ± 2.92 18.99 ± 3.81
2006-2005 95.85 ± 0.55 91.18 ± 1.28 92.10 ± 1.93 90.56 ± 1.69 9.74 ± 1.72 7.90 ± 1.93
2005-2004 89.93 ± 0.66 80.29 ± 1.54 81.04 ± 2.34 79.81 ± 2.58 20.42 ± 2.53 18.96 ± 2.34
RVM Train Test recall precision errorTypeI errorTypeII
2006 97.88 ± 0.63 81.25 ± 1.78 67.35 ± 2.98 92.31 ± 1.98 5.39 ± 2.01 32.65 ± 1.45
2005 93.25 ± 0.54 76.75 ± 1.25 72.64 ± 2.19 79.35 ± 2.34 19.09 ± 1.78 27.36 ± 2.03
2006-2005 99.68 ± 0.35 80.71 ± 2.11 72.47 ± 6.08 89.47 ± 2.55 8.71 ± 2.56 27.53 ± 6.08
2005-2004 100.00 ± 0.0 70.75 ± 1.74 65.36 ± 2.29 73.68 ± 1.53 23.46 ± 1.03 34.64 ± 2.29

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

S-Isomap with Euclidean distance - KNN and SVM
SVM Testing Accuracy with S-Isomap

nldr=3
94 nldr=5
nldr=10

92
Accuracy in %

90

88

86

3 4 5 7 10 15 20 40 60 80 100 150 200
K Nearest Neighbors

ICONIP 2008

Outline
Motivation
Data set
Evaluation metrics
Proposed approach
Results
Experimental setup

Discussion of Results
S-Isomap presents better results in testing accuracy than
single KNN and RVM by 2% and 10%
S-isomaps presents comparable results with SVM, however,
with much reduced embedded space (nldr=3) whereas SVM
algorithm is used with all financial ratios
The error of type II, corresponding to a failure of the correct
prediction of bankruptcy is lower for the SVM.
The same happens with a false alarm, i.e., indicating a
bankruptcy for a healthy firm, which corresponds to the error
of type I.
The fact that firms clump nicely in the reduced space not only
enhances financial data visualization but also improves
prediction results as compared with the kernel machines.
ICONIP 2008

Outline
Motivation
Proposed approach
Experimental setup

We proposed an approach for bankruptcy analysis and
prediction based on a supervised Isomap algorithm where class
label information is incorporated
Assuming that corporate financial statuses lie in a manifold
we attempt to uncover this embedded structure using
manifold learning
Isomap acts as a preprocessing stage allowing financial data
visualization
Results have shown that comparable testing accuracy can be
obtained even using a 3-dimensional reduced space
Although the results in the finance setting seem promising,
further work is necessary to design a method for avoiding the
interpolation error resulting from the mapping learning stage.
ICONIP 2008

Manifold learning for credit risk assessment

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (6)

Similar to Manifold learning for credit risk assessment

Similar to Manifold learning for credit risk assessment (20)

More from Armando Vieira

More from Armando Vieira (20)

Manifold learning for credit risk assessment