1.
COMBINING STATISTICAL AND MACHINE
LEARNING BASED CLASSIFIERS
IN THE PREDICTION OF CORPORATE FAILURE
S. Dizdarevic, P. Larrañaga, B. Sierra, J.A. Lozano, J.M. Peña
Department of Computer Science and Artificial Intelligence
University of the Basque Country
Abstract
This project presents the application of methods coming from Statistics as well as from an area of
the Artificial Intelligence called Machine Learning, in the problem of the corporate failure
prediction. The empirically compared paradigms applied to a sample of 120 Spanish companies,
60 of which had gone bankrupt, and 60 had not, are Discriminant Analysis, Logistic Regression,
Classification Trees, Rule Induction and Bayesian Networks. Two Artificial Intelligence
techniques - Voting by Majority Principle and Bayesian Formalism -, are implemented in order to
obtain prediction improvement over the single models that are compared. The predictor variables
that gather the accountant information taken for every company over the three years previous to
the date of survey are financial ratios.
2.
1. Introduction
Corporate failure prediction, through classification of known cases and generalization to
other cases, has been a subject of study for almost 30 years. Accurate prediction of corporate
failure is important to investors, creditors and auditors. It also can help shareholders, creditors
and governments to avoid heavy losses stemming from surprise bankrupts. Therefore, using
analytic tools and data from corporate financial reports, one can evaluate and predict its future
financial status.
Although the idea of a company going bankrupt is associated with its disappearance,
before this really happens the company has gone through a long period of crisis with several
stages in it. Many authors comprise them into two main stages taking into account the two senses
of the concept of failure: economic and financial. The economic failure starts when the
profitability of the invested capital is under its costs of opportunity, receiving its owner an
investment yield lower than other alternative opportunities with the same risk. As the economical
failure advances and settles down in the company, the incomes start to be lower than the expenses
appearing the first negative results.
The deterioration produced during the economic failure process, if not corrected, will
drive the company into technical insolvency. This is the first stage of what is called the financial
failure. In this situation the company does not have enough liquid for the payments as these are
increasing.
The breaking point of this ruinous process will be reached when the company is not only
unable to pay off its falling dues but also in a situation of negative net patrimony. This means
that its current liabilities are greater than the value of its assets, so it could soon lead the
company to its disappearance.
The study of the corporate failure must be dealt always through the search of its causes
that we can analyze through their visible symptoms. As Argenti (1976) proposes, it is very
interesting to know the reasons why others companies have failed in order to avoid them in ours.
Nevertheless, the capture of these causes is impossible if not through the discovery of their
symptoms. Some of these causes are the following: management fault, deficiency in the systems
of accounting information, disability of adaptation to the changes of environment, overtrading,
the launch of big projects, abuse of financing by debt, the actual risks from the business world.
As for the symptoms, Argenti accepts as such, the deterioration suffered by the financial ratios,
as the corporate comes closer to failure, and indicating also that accounting manipulation is one
clear symptom of the failure.
According to Platt (1985) different sources that an external economic agent can use to
detect the aforementioned symptoms, can be grouped together into three sets of sources: the
common sense, the analysis of statement of accounts published publicly by the companies and
statistical tools.
2
3.
The use of common sense, although a very simple strategy, has the following advantages:
it does not need sophisticate computers and there is no need of assimilation of financial
conditions, sometimes not easily understandable. All you should do is to pay attention to the daily
reality of the corporate and its environment. Some signs of getting close to a situation of crisis
are: auditor change, development of the relation with the new value, the members of council
resign suddenly, credit lines are reduced or canceled, the sale of stocks done by the directors of
the board, the appreciation of its stocks in the market to the prices inferior to its accounting
value, excess of stock,…
The analysis of the statement of accounts is part of a process of information whose aim is
to provide data for decision making. The idea of failure, and more precisely, the idea of
insolvency has remained connected to the technique of accounting ratios. It was thought that the
ratios are worsen as the corporate approached the crisis process, in this way the deterioration
suffered by the corporate could be measured.
Due to big complicity of information and comprehension contained in financial statement
data, the analysis of financial ratios, which gather all this information, has been the most used
technique. The great interest in comparison between different companies (industrial sector, size,
…) has influenced its use. There are two main difficulties related to financial ratios, their
creation and their interpretation. Another difficulty added to the previous ones is that the same
value of ratio for two companies from different sectors may represent different situations.
The financial information gathered in ratios has to be homogenized, so that it could be used for
description and prediction of corporate failure. The second task is directly related to the use of
Statistics.
Although ignored for half a century by the analysts, nowadays the use of statistical
techniques have became a helpful tool commonly used because they give objectivity to the
analysis. Beaver (1966) was among the pioneers who used them for the analysis of financial
ratios in order to predict corporate failure. In his work, starting from 30 variables-ratio taken
from 79 pairs of companies, failed and non-failed, 6 variables-ratio are selected. An analysis of
profiles is based upon them by comparing the means of the values of every ratio in each group,
failed and non-failed, and observing the important differences, five years before the date of
failure. Beaver developed a dichotomic heuristic test of classification for every ratio by using a
process of trial and error that allowed him to choose the suitable cut-point for every ratio and
every year that minimized the errors of classification.
Nevertheless the univariate model of Beaver contrasts with the inherent character of
multivariable documents of the financial situation. Therefore, in order to make the above
mentioned documents valuables, they will have to be interpreted from a perspective that allows to
think over the several financial aspects of a corporate as a whole. The search of this perspective
has been the reason why several researchers have used multivariate statistical techniques for the
corporate failure prediction.
Altman (1968) was the pioneer in application of Discriminant Analysis to the
aforementioned problem obtaining surprising results. The lineal combination of five ratios in a
profile created a score capable of discriminating between “healthy” and “failed” companies with
very high percentages of success in the two years previous to the failure. The initial work of
Altman was adapted, updated and improved by several researchers. It is worth noticing the works
of Deakin (1972), Blum (1974), Edmister (1972), Libby (1975), Scott (1981), and Taffler
(1982).
The necessity of a statistical alternative to avoid the problems related to the Discriminant
Analysis leads to the use of models of conditional probability, logit and probit, more flexible in
their requirements. Ohlson (1980) is considered as the first author who published a model for the
prediction of failure based on conditional probability models. Though he had no brilliant results
Page - 3 -
4.
his methodology was followed by other authors: Mensah (1983), Zavgren (1985), Casey and
Baztczak (1985), and Peel and Peel (1987).
This chapter is organized as follows. Section 2 presents the features of the case study
used for carrying out the empirical comparison among several paradigms coming from Statistics
and Artificial Intelligence and the combining techniques. These paradigms are explained in
Section 3. Section 4 shows the results obtained for every method in terms of the percentage of
well-classified companies, as well as models descriptions and analysis of the results. In Section 5
it finishes with the conclusion of the work, proposing further research.
2. Problem Description
Starting from the hypothesis that the accounting information pattern of non-failed and
failed companies are different, the fundamental aim of this chapter was to show by means of an
example how to create models, able to predict in advance (1 year, 2 years and 3 years) the failure
of companies. These models could be considered as normative systems as they are founded on the
probability theory.
Following the recent progressive research in Artificial Intelligence two techniques have
been implemented and used for integration of individual models in one, in order to improve
predictive ability of every one.
In this section the problem is presented, dealing with aspects of it such as, the concept of
failure, sample obtaining and validating, selection of financial ratios with which models can be
constructed, and sample for multiple models. A more detailed description of the failure problem
can be found in Lizarraga (1996), which could be considered as one that inspired elaboration of
this project.
The data sample used here was the same that Lizarraga gathered from several Provincial
Trade Register Offices and used for empirical comparison in his doctoral dissertation. The
following is the procedure of how the data sample of 120 companies was selected and formed.
The need of determining the concept of failure to use was the first methodological
problem to solve. Finally, he chose the concept of suspension of payments, given that it is related
not with a specific financial problem but with a situation of profound economic crisis. This
concept presents three fundamental advantages: objectivity, it gives a representative date of the
moment of failure and the large increment in the number of companies which had to turn to it in
the period of study. Finally, the availability of the annual accounts deposited in the several
Provincial Trade Register Office was another aspect that helped Lizarraga to carry out the
empirical work of information gathering.
The sample was made of 120 companies, half of them belonged to a group of companies
classified as “failed” and the other half was classified as “healthy” in order to incorporate them to
the analysis. The selection was carried out by a matching process. Using a list of “failed”
companies previously selected, matching them with a “healthy” corporate of the same size and
industrial sector. This matching process is justified by the convenience of avoiding any possible
distortion effect related with the size and industrial sector. As the access to each Provincial Trade
Register Office was not possible Lizarraga decided to reduce the scope to the 10 provinces with
larger number of records of payment suspension requested during the period of the study. These
4
5.
selected provinces gathered the 63% of the total number of records of payment suspension. The
information was gathered through the Official Bulletin of the Trade Register Office. The interval
of time was of 18 months (from January 1993 to July 1994), and it can be regarded as
representative of a period of severe crisis among the companies in Spain. For every company in
the study the economical and financial data corresponding to the three years previous to the end
of the study were obtained.
Though Lizarraga (1996) validated the model created using a sample of 44 companies
(22 “healthy” and 22 “failed”) different from the ones used to construct the models, in this study
the models are created using a sample of 120. A validation of the original model, based in the 5-
fold cross-validation (Stone 1976) has been carried out, as well as another validation
methodology which will be explained later. Estimates of the goodness of fit of every model, that
is, the probability of the model classifying correctly, are calculated following the next steps: first,
the sample is randomly ordered and then partitioned in 5 disjointed subsets. Secondly, choosing 4
of those subsets as training set, a model - which was tested with the fifth subset -, is obtained.
These steps are repeated five times, using every time a different testing set and obtaining this way
5 percentages p 1 , p 2 , …, p 5 of well-classified cases, as well as the “destiny” (failed or non
failed) of each one of 120 companies distributed between five disjointed test sets. The estimation
of p , the probability with which the model created with the original sample classifies correctly
is given by the following formula:
p
p
=
=
∑
i
i 51
5
.
Figure 1 shows graphically the process of estimation of the goodness of fit of the model with
5-fold cross-validation.
Training 1 ◊ model 1
Test 1 ◊ p 1
Training 2 ◊ model 2
Test 2 ◊ p 2
Training 3 ◊ model 3
Page - 5 -
6.
MODEL
p
p
=
=
∑
i
i 51
5
Test 3 ◊ p 3
Training 4 ◊ model 4
Test 4 ◊ p 4
Training 5 ◊ model 5
Test 5 ◊ p 5
Figure 1. Process of estimation of the goodness of fit of the model with 5-fold cross-validation
The other approach to model validation presented in this chapter is the next one. Using 4
of 5 disjointed subsets (got with 5-fold cross-validation) as training set a model is obtained and
later tested with a sample of the 120 companies, instead of 24 used in previous approach. The
same process of estimation of p - the probability with which the model created with the original
sample classifies correctly -, explained before is used. Figure 2 shows graphically the process of
estimation of the goodness of fit of the model with this validation.
Training 1 ◊ model 1
Test 1 ◊ p 1
Training 2 ◊ model 2
Test 2 ◊ p 2
Training 3 ◊ model 3
MODEL
6
7.
p
p
=
=
∑
i
i 51
5
Test 3 ◊ p 3
Training 4 ◊ model 4
Test 4 ◊ p 4
Training 5 ◊ model 5
Test 5 ◊ p 5
Figure 2. Process of estimation of the goodness of fit of the model with the second validation
Lizarraga selected the explanation variables from the certificate of the Trial Balance,
from the profit and loss account and from the financial chart. The election of these financial
ratios was based on two main criteria:
i) they were frequently mentioned in the literature treating the analysis of financial
statements,
ii) the possibility of calculating them in a direct way or with simple adjustments.
Starting from the 50 variables that Lizarraga had obtained in the previous selection, a
Principal Components Factorial Analysis was made. It allowed to reduce them to 9 ratios which
were used.
These ratios were:
X1. CURRENT ASSETS / CURRENT LIABILITIES
X2. CURRENT ASSETS / TOTAL ASSETS
X3. NET RESULT / TOTAL ASSETS
X4. EARNINGS BEFORE INTEREST AND TAXES / FINANCIAL CHARGES
X5. OWN FUNDS / TOTAL DEBT
X6. SALES / OWN FUNDS
X7. STOCKS / SALES
X8. DEBTORS / SALES
X9. OPERATIVE CASH FLOW º / TOTAL ASSETS
X1 - CURRENT ASSETS / CURRENT LIABILITIES ratio measures the firm’s ability
to meet maturing short-term obligations. It is used to measure liquidity - if the firm has sufficient
cash to meet short-term claims.
X2 - CURRENT ASSETS / TOTAL ASSETS ratio represents the position of current
assets (money, bank, quick assets, stocks…) in total assets (current assets + fixed assets).
X3 - NET RESULT / TOTAL ASSETS ratio is a profitability ratio which explains the
extend to which a company earns the highest return possible for resources used or capital
employed, consistent with the desire and ability to assume risk.
Page - 7 -
8.
X4 - EARNINGS BEFORE INTEREST AND TAXES / FINANCIAL CHARGES ratio
is calculated by dividing the earnings before interest and tax reductions by the financial charges
of a firm. In essence, it is a measure of the true productivity of the firm’s charges. Earnings
before interest and taxes is calculated as: earnings before taxes + financial charges.
X5 - OWN FUNDS / TOTAL DEBT ratio is usually used to measure debt. It is
calculated by dividing its funds by total debt. Own funds are those funds for financing that
consist of capital that the owner deposited at the beginning of firm’s existence, reserves,
amortization, subventions ….
X6 - SALES / OWN FUNDS ratios is calculated by dividing sales by own funds. Sales
represents net income of firm’s business.
X7 - STOCKS / SALES is ratio used to observe rotation of stocks. It is calculated by
dividing stocks (merchandise, raw materials, products terminated and semi-terminated…) by
sales. It represents percentage of stocks in sales.
X8 - DEBTORS / SALES ratio used to observe rotation of current assets. It is calculated
dividing debtors by sales.
X9 - OPERATIVE CASH FLOW º / TOTAL ASSETS ratio is calculated by dividing
total assets into operative cash flow. This last value is calculated with the following formula:
OCF º = traditional cash flow ( net result + amortization + large term reserves ) - increment of
exploitation of current assets + decrement of exploitation of current liabilities.
The several classificatory models constructed - described in the following section - use the
aforementioned ratios.
The process of sample for obtaining the combined classifiers is conducted as follows.
Regardless what internal structure an individual classifier has, and methodology it bases on, a
classifier may be regarded as a function that receives an input individual X and outputs a class j ,
or in short denoted by e(X) = j. In the problem of corporate failure there are only two classes:
non-failed and failed, represented in our case by class 1 and by class 2 respectively. As explained
before an individual X is represented by the set of nine ratio-variable values. Even though some
classifiers may supply some extra information like, probability with which the class j is assigned
to the individual X, only class j was considered as the output information. The decision to use this
approach was based on its generality, as the individual classifiers could be very different from
each other in their methodologies and this kind of information is common for all of them. A good
summary of existing techniques for combining classification results could be found in Xu et al.
(1992) and Ho et al. (1994).
Having the result (class j) for every classifier, these are now considered knowledge from
different sources and used as input data for the combined classifiers. As explained before, two
validation methods are used. Figure 3 shows graphically the process of individual classifiers
output data’s integration when 5-fold cross-validation is used.
PARADIGMS
individual
s
DA LR CN2 CART Bayes
test
set 1
1…24
test
set 2
25…48
8
9.
test
set 3
49…72
test
set 4
73…96
test
set 5
97…120
Figure 3. Process of construction of the sample for combined classifier with
5-fold cross-validation
Elements of pair (test set i , paradigm j), where i = 1 ... 5; and j = 1 … 5, represent the
classes that model i , created with training set i , using paradigm j, assigns to each one of 24
individuals of the test set i.
Figure 4 shows graphically the same process but when using the second validation
explained before.
PARADIGMS
DA LR CN2 CART Bayes
individuals M1 … M5 M1 … M5 M1 … M5 M1 … M5 M1 … M5
1
2
:
:
120
Figure 4. Process of construction of the sample for combined classifier with
the second validation
The pair ( individual i, ( paradigm j, model k )) , where i = 1 ... 120; j = 1 … 5; and
k = 1 … 5, represents the class that model k (M k) created with training set k , using
paradigm j, assigns to the individual i.
Methodologies used to integrate individual classifiers that use aforementioned matrixes
are described in the following section.
3. Methodologies
3.1 Methodologies - Individual Classifiers
Supervised classification is one of the tasks most frequently carried out by the so-called
Intelligent Systems. Thus, a large number of paradigms developed either by Statistics (Logistic
Regression, Discriminant Analysis, K-N-N) or by Artificial Intelligence (Neural Networks, Rule
Induction, Classification Trees, Bayesian Networks) are capable of carrying out classification
tasks.
The comparison among the previous paradigms cannot be carried out in an analytical
way, thus it is usual to compare them empirically, - Michie et al. (1994).
Page - 9 -
10.
Five of the previous paradigms have been selected to make this comparison -
Discriminant Analysis, Logistic Regression, Classification Trees, Rule Induction and Bayesian
Networks.
3.1.1 DISCRIMINANT ANALYSIS
Discriminant Analysis - introduced by Fisher (1936) - is a statistical technique used in
Pattern Matching. The main use of Discriminant Analysis is to predict group membership from a
set of predictors. Discriminant function analysis consists of finding a transformation which gives
the maximum ratio of difference between a pair of group multivariate means to the multivariate
variance within the two groups. Accordingly, an attempt is made to delineate based upon
maximizing between group variance while minimizing within group variance. The predictors
characteristics are related to form groups based upon similarities of the distribution in the p-
dimensional space, which are then compared to groups which are input by user as truth. This
enables the user to test the validity of groups based upon actual data, to test groups which have
been created, or to put objects into groups. Two things can then be done with Discriminant
Analysis (DA) : analysis and classification. Analysis is related to how the material is to be
interpreted i.e., are there any differences between the groups and which variables make these
differences?. Classification is used to find a set of variables - which provide satisfactory
discrimination - so that classification equations can be derived, their use checked out through
hit/rate tables, and if good, they can be used to classify new subjects who were not in the original
analysis. DA creates a discriminant lineal function starting from the predictor variables x1, x2, …,
xp and the results of this function are used later as a classification criteria. The discriminant
function takes this form:
Z = a1x1 + a2x2 + … +apxp
xj are explaining variables (j = 1, 2 , …, p),
aj real coefficients estimated by the model (j = 1, 2 , …, p),
Z discriminant score.
The discriminant function is used for classifying new individuals starting from rules in
the following way:
“If a a ... a ci i
p p
i
1 1 2 2x x x+ + + > then individual i is classified as 0,
on the contrary is classified as 1.”
The procedure PROC DISCRIMINANT of SPSS software has been used in this project.
The method selected to model construction was MAHAL, and variables were selected using
stepwise selection. Selection rule was: maximize minimum Mahalanobis distance between
groups. For each model some of the p variables were selected, and their corresponding
coefficients were calculated.
3.1.2 LOGISTIC REGRESSION
Logistic Regression (Hosmer and Lemeshow (1989)) is a method coming from Statistics
whose objective is to obtain a functional relationship between a transformation - from a
qualitative variable - called logit and p predictor variables which can be either quantitative or
qualitative.
It is used to develop a model which attempts to adjust the best and to be sufficiently
reasonable to describe the relation between the result (dependent variable or the response) and the
set of independent (or explanatory) variables. The fundamental characteristic of this regression is
10
11.
that the dependent variable is dichotomic. Mathematically the function used in logistic
distribution is extremely flexible and easy to use.
Indicating the dichotomic variable to predict by Y and the p predictor variables by
x1, ..,xp, the objective is to determine the coefficients β β β0 1, ,..., p in order to satisfy the logit
transformation formula:
g(X) = ln
P Y
P Y
x xp p
( )
( )
=
=
= + + +
1
0 0 1 1β β β
The coefficients ( β0 is the intercept and β β1 ,..., p are the p parameters)
determination is carried out by the method of maximum likelihood. These coefficients are
interpreted in terms of "odd-ratios", and the variables selection can be carried out by three
methods: forward, backward or stepwise. Stepwise method is the most commonly used one. In it
the variables are selected to be included or excluded from a statistical criteria.
After obtaining g(X), a classification model can be constructed. The logistic regression
model is described by the following formula:
π(x) =
e
g X
e
g X
( )
( )
1+
.
It is used to classify new individuals starting from rules in the following way:
“ If π(x) > c then individual i is classified as 0,
on the contrary is classified as 1”.
For this work the PROC LOGISTIC procedure of the SAS software has been used for the
empirical comparison.
3.1.3 RULE INDUCTION
In the task of constructing Expert Systems, systems for inducing concept descriptions
from examples have proved useful in easing the bottleneck of knowledge acquisition. One of these
induction systems, CN2 (The Turing Institute (1988)), has been used as a representative of the
approach called Machine Learning. CN2 was designed for the efficient induction of simple,
comprehensive production rules in domains where problems of poor description language and/or
noise may be present. CN2 produces an ordered list of if-then rules, rather than an unordered set
of if-then rules, but also is possible to produce the last one changing the evaluation function. The
rules induced by CN2 are of the form:
"if complex then predict class"
where complex is a conjunction of attribute tests.
In order to carry out a classification of new examples CN2 applies an interpretation in
which each rule is tried in order until one is found whose conditions are satisfied by the example
that is wanted to be classified. The resulting class prediction of this rule is then assigned as the
class of that example. If no induced rules are satisfied, the final default rule assigns the most
commonly occurring class in the training data for all new examples.
The learning algorithm of CN2 works in an iterative fashion, by means of searching in
each iteration for a complex covering a large number of examples of a single class C and few of
Page - 11 -
12.
other classes. Having found a good complex, those examples it covers are removed from the
training set and the rule "if complex then predict C" is added to the end of the rule list. This
process iterates until no more satisfactory complexes can be found.
In this project The CN2 Induction Algorithm developed by The Turing Institute has been
used with the option to produce ordered list of if-then rules (The CN2 Induction Algorithm).
3.1.4 CLASSIFICATION TREES
The Classification and Regression Trees (CART) software was developed by Breiman et
al. (1984). CART is used for classification or regression analysis of large, complex data sets
containing many variables, and is used to construct both binary prediction and classification trees
by recursive partitioning (or targeted stratification). CART can construct a binary prediction tree
with a structure which is easy to understand, interpret, and use. Tree construction starts by using
computationally intensive algorithm that recursively searches over all the variables to produce a
sequence of optimal binary splits, so that an extremely large tree is grown. A pruning algorithm is
then applied which cuts off all branches of the tree that impair the overall accuracy. The result is
a parsimonious decision tree: the simplest tree that gives the maximum accuracy. The tree can be
inspected and used to modify the staging algorithm. The procedure is repeated until the best set is
found. Cross-validation can be performed a number of times with each analysis.
An introduction to Classification Trees paradigm which is a popular representation of
classifiers. The interior nodes of a classification tree are tests applied to instances during
classification. Branches from an interior node correspond to the possible test outcomes.
Classification begins with the application of the root node test, its outcome determining the
branch to a succeeding node. The process is recursively applied until a leaf node is reached. Then
the instance is labeled with the class of the leaf node, and the process halts. The trees are
constructed beginning with the root of the tree and proceeding down to its leaves.
The family's palindrome name (TDIDT) emphasizes that its members carry out the top-
down induction of decision trees. The patriarch of this family is Hunt's concept learning system
(CLS) framework, (Hunt et al. (1966)). ID3 (Quinlan (1979) and Quinlan (1986)), one of the
most famous TDIDT was developed from CLS, and used a selection measure based on the
computation of an information gain for each variable, and the variable that maximizes this gain is
selected. A notable disadvantage of this measure is that it is biased towards selecting variables
with many different values. ACLS (Patterson et al. (1983)) and ASSISTANT (Cestnik et al.
(1987)), acknowledges ID3 Quinlan (1979) as its direct ancestor. ASSISTANT has been used in
several medical domains with promising results. In our example the statistical software SPAD.S
has been used. In it the approximation proposed by Breiman et al. (1984) has been implemented.
The procedure used for classification tree’s construction was DISAR. This procedure constructs
complete binary decision trees for discrimination of 2 groups, categorized by modalities of one
nominal variable.
3.1.5 BAYESIAN NETWORKS
While the first attempts to building Expert Systems regarded probabilities as an
underlying formalism, the large number of parameters to estimate - 2 1n
− for the case of n
dichotomic variables - forced the researches to reject this massive approach and instead adopt
probabilistic models based on the hypothesis of independence among variables. These models
presented the advantage of their simplicity but they were incapable of giving good solutions in
problems with a high degree of interdependence among variables.
As a consequence of these extreme positions the probability was not taken into account in
the building of Expert Systems until the late 80s when Lauritzen and Spiegelhalter (1988)
12
13.
developed an algorithm for the propagation of evidence. This algorithm allowed probabilistic
reasoning within graphical models which represented conditional independence among the
variables of the system.
Excellent introductions to Bayesian Network paradigm can be found in Castillo et al.
(1997) and Jensen (1996).
One possible classification of the structural learning methods can be carried out by
grouping them into two main groups.
(i) Methods based on hypothesis tests which check the assumption of conditional
independence among variables and create a Bayesian Network structure which
represents the obtained independence.
(ii) Methods based on score and search, in which a measure of the goodness of fit
(likelihood, entropy, percentage of well-classified) of a particular structure is
defined as well as, a searching procedure over the space of all possible structures of
Bayesian Networks. This searching procedure usually works like a greedy
algorithm.
In our case we used Genetic Algorithms as an intelligent metaheuristic in the searching
process. Genetic Algorithms (Goldberg (1989)) use a direct analogy with the natural behavior.
They work with a population of individuals, each individual representing a feasible solution to a
given problem. To each individual we assign a value or score according to the goodness of that
solution represented by that individual.
The better the adaptation of the individual to the problem, the more probable is that the
individual will be selected for reproduction, crossing its genetic material with other individual
selected in the same way. This crossover will produce new individuals - offspring of the previous
- which share some of the features of their parents. In this way a new population of feasible
solutions is produced, replacing the previous one and verifying the interesting property of having
a greater proportion of good features than the previous population. Thus, through these
generations, good features are propagated through the population. Favoring the cross of the fittest
individuals the most promising areas of the search space will be explored.
Figure 5 summarizes the pseudo-code for the so-called Abstract Genetic Algorithm. In it, the
parent selection doesn't need to be made by assigning to each individual a value proportional to
its objective function, as is usual in the so-called Simple Genetic Algorithm. This selection can be
carried out by any function that selects parents in a natural way. It is worth noticing that
descendants are not necessarily the next generation of individuals, but that this generation is made
up of the union of parents and descendants. That is why the operations of extension and reduction
in the cycle are required.
begin AGA
Make initial population at random
WHILE NOT stop DO
BEGIN
Select parents from the population
Produce children from the selected parents
Mutate the individuals
Extend the population by adding the children to it
Reduce the extended population
END
Output the best individual found
end AGA
Figure 5. The pseudo-code of the Abstract Genetic Algorithm
Page - 13 -
14.
The individuals of the Genetic Algorithm will be Bayesian Network structures. A
Bayesian Network structure, for a fixed domain with n variables, can be represented by a n x n
connectivity matrix C, where its elements, cij , can be defined as:
c
j
ij =
1 if is a parent of ,
0 otherwise.
ι
In this genetic approach, an individual of the population is represented by the string:
c c c c c c c c cn n n n nn11 21 1 12 22 2 1 2
As can be seen, in the case that a total order among the variables is assumed, Larrañaga
et al. (1996a), the usual genetic operators are closed operators with respect to the DAG
conditions. In the more general case in which there is no assumption of order among the
variables, Larrañaga et al. (1996b), the usual genetic operators are not closed operators and to
assume the closeness a repair operator is introduced. The objective of this repair operator is to
transform the children structure that do not verify the DAG conditions into DAGs, by randomly
eliminating the edges that invalidate the DAG conditions.
This approach, as can be seen in the previous reference has yielded good results in the
reconstruction of the ALARM net. Although, in the corporate failure problem the cardinal of the
searching space is not too large, the previous approach has been used. See Larrañaga et al.
(1996c) in order to consult another approach about the problem of structural learning of
Bayesian Network in which an individual is a cyclic permutation of n considered variables, and
the Genetic Algorithm uses crossover and mutation operators developed for the Traveling
Salesman Problem.
Genetic Algorithms have been used as optimizing means in other combinatorial problems
that arise from the Bayesian Networks context. Thus, for example, in Larrañaga et al. (1997)
they obtain good decomposition of the moral graph associated with the propagation algorithm
proposed by Lauritzen and Spiegelhalter. Finally in Larrañaga et al. (1996d) the problem of the
fusion of Bayesian Network proposed by different authors, seeking the consensual BN is handled.
Taking into account that in a Bayesian Network any variable is only influenced by its
Markov Blanket, that is, its parent, children variables and the parent variables of his children
variables, it seems to be intuitive to search in the set of structures that are Markov Blanket of the
special variable (Sierra and Larrañaga (1997)).
Now, individuals in the Genetic Algorithms are Markov Blankets for the variable to be
classified. One operator has been introduced that guarantees that the obtained children comply
with a Markov Blanket of the variable to be classified. On the other hand the score used to search
for the best Markov Blanket uses the percentage of well-classified individuals obtained by
applying the evidence propagation feature of the HUGIN software, Andersen et al. (1989).
3.2 Methodologies - Combining Classifiers
Recently in the area of Artificial Intelligence the concept of combining classifiers is
proposed as a new direction for the improvement of the performance of individual classifiers.
These classifiers could be based on a variety of classification methodologies, and could achieve
different rate of correctly classified individuals. The goal of classification result integration
algorithms is to generate more certain, precise and accurate system results. As explained in Xu et
al. (1992), there are a couple of problems in the process of classifiers combination to be solved.
14
15.
First of all there is a problem of “how many classifiers to choose? and what kind of classifiers
should they be?”. The second problem is to determinate the adequate technique to use to combine
the results of chosen classifiers so that a better result could be obtained.
In this work five individual classifiers are chosen coming from two different areas,
Statistics and Machine Learning. To tackle the second problem a large variety of combining
techniques was consulted, and finally two of the several approaches proposed in Xu et al. (1992),
were chosen. These techniques are The Voting by Majority Principle and The Bayesian
Formalism. Decision to use this combining approaches has been influenced by their generality.
They allow to integrate different classifiers based on different methodologies, which at least can
supply the output information at the abstract level - class j -, what is the only requirement these
techniques have.
The following are the two techniques for the combination of individual classifiers used.
The combination is based on the output information of the abstract level. The representation that
will be used to explain their characteristics is the following:
“Given K individual classifiers, ek k=1,…K, each of which assigns to the input X a
label jk , i.e. produces an event ek (X) = jk , the problem is to use these events to build an
integrated classifier E, which gives X one definitive label j, i.e. E(X) = j, and j ∈ Λ ∪ {M+1},
where Λ = {1,2,…M}, representing a set of specified patterns called class and {M+1}denotes
that e has no idea which class X comes from, or in other words, X is rejected by e.” (e.g. M=2 for
corporate failure problem, and K=5 in this work - DA, LR, CN2, CART, and BAYES ).
3.2.1 VOTING BY MAJORITY PRINCIPLE
As indicated before, the problem is to produce a new event E(X) = j from the given
events ek (X) = jk , k=1,…K, where the following equation may not necessarily hold:
e X e X e XK1 2( ) ( ) ... ( ).= = =
That is, conflicts may exist among the decisions of K classifiers. A simple and common
rule used for solving this kind of conflicts in human social life is Voting By Majority Principle. In
it if the majority of the K classifiers vote for the same label then the combining classifier takes
this label as the final result, where majority means that more than a half of the classifiers ek vote
for the same label.
The most general majority voting rule is the following:
E X
j if T X j max T X i
K
M otherwise
E i E
( )
, ( ) ( )
,
=
= = = >
+
∈Λ
2
1
where T X i T X i i ME k
k
K
( ) ( ), ,...= = = =
=
∑1
1
and T X i
if e X iandi
otherwisek
k
( )
, ( )
,
= =
= ∈
1
0
Λ
Page - 15 -
16.
that is, the combined classifier E decides that X comes from class j, if the majority of the
K classifiers decide that X comes from class j, otherwise it rejects X. It is easy to see that
classifiers that reject X have no impact on the combined E, unless the majority of them reject X
or if the number K is even and the half of them vote for one class and half for another class. In
this project the individual classifiers are five and none of them has the option to decide that class
of individual X is unknown, so there won’t be any situation where integrated classifier E won’t
know which class to assign to X or in other words there won’t be any X rejected by E.
3.2.2 THE BAYESIAN FORMALISM
In the previous voting method that combine the results of individual classifiers, it is only
based on the label outputted by each classifier (i.e., the event e X jk k( ) = ). Each of ek (X) =
jk ’s is equally treated as one vote without considering the error of each ek itself. This method
will take these errors into consideration, representing each classifier’s error by a confusion matrix
that is given by:
PT
n n n
n n n
n n n
k
k k
M
k
k k
M
k
M
k
M
k
MM
k
=
11 12 1
21 22 2
1 2
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( )
...
...
: : ... :
...
for k=1,2…K (1)
where each row i corresponds to class i and each column j corresponds to the event
e X jk ( ) = . Thus, an element nij
k( )
denotes that nij
k( )
samples of class i have been assigned a
label j by ek . It follows from (1) that the size of the sample is:
N nk
ij
k
j
M
i
M
( ) ( )
=
==
∑∑
11
(2)
in which the number of cases in each class i is:
n n i Mi
k
ij
k
j
M
•
( ) ( )
, ,...= =
=
∑1
1 (3)
and the number of cases that are assigned class j by ek is:
n n j Mj
k
ij
k
i
M
•
( ) ( )
, ,... .= =
=
∑1
1 (4)
For an event e X jk ( ) = of an error-bearing classifier ek , its truth (i.e., X comes from
class j) has uncertainty. With the knowledge of its confusion matrix PTk , such an uncertainty
could be described by the conditional probabilities that propositions X ∈ i, i=1,…M are true
under the occurrence of the event e X jk ( ) = , that is:
16
17.
P X i e X j
n
n
n
n
k
ij
k
j
k
ij
k
ij
k
i
M( / ( ) ) .
( )
•
( )
( )
( )
∈ = = =
=
∑1
(5)
From another point of view , the confusion matrix PTk could be regarded as the prior
knowledge of an expert. The expert can express his belief for the event e X jk ( ) = , by a real
number bel(.) , called belief value. The higher the bel(.) he gives to a proposition, the more
likely it is true. With the knowledge of PTk , bel(.) for proposition X i∈ is given in the form
of a conditional probability, given by (5), represented with the following equation:
bel X i e X j P X i e X j
n
n
n
n
k k
ij
k
j
k
ij
k
ij
k
i
M( / ( ) ) ( / ( ) ) .
( )
•
( )
( )
( )
∈ = = ∈ = = =
=
∑1
(6)
With K classifiers e e eK1 2, ,..., ,there will be K matrices PT PTK1 ,..., . When
these classifiers are used on the same input X, K events ek (X) = jk , k=1,…K will happen.
Each ek (X) = jk and its corresponding PTk could supply a set of:
bel X i e X jk( / ( ) )∈ = , i=1,…M.
Now, the question is how to integrate these individual supports to give the combined
value bel i( ) . Xu et al. (1992) deduce that bel i( ) could be calculated as follows:
bel i P X i e
k
X
k
K
j
k
( ) ( / ( ) )= ∈
=
∏ =η
1
(7)
with η as an constant that ensures that bel i
i
M
( ) =
=
∑ 1
1
(since x ∈ i, i=1,…M are
mutually exclusive and exhaustive). That is:
1
11η
= ∈
=
∏
=
∑ =P(X i / e
k
k
K
i
M
(X) j
k
). (8)
Finally, depending on these bel i( ) values, integrated classifier E can classify X into a
class according to the decision rule given by:
E X
j ifbel j max
i
beli
M otherwise
( )
, ( ) ()
,
=
=
∈
+
Λ
1
that is, E decides that X comes from class j, if the belief value bel j( ) is the greater
than any other belief bel i( ) i=1,…M, otherwise it rejects X or in other words E has no idea
about which class X comes from.
4. Results
Page - 17 -
18.
In this section the results of applying the methods described in the previous section to the
problem of corporate failure are explained in full detail. The models for each one of five
paradigms used in this project, and for a period of time before failure (1 year, 2 years and 3
years) are shown in this section. The results are represented with the percentage of well-classified
companies for each paradigm and period.
Discriminant Analysis (DA)
Experiments with this paradigm were carried out using the SPSS software. The
following are the classifiers, obtained with a sample of 120 companies, for each one of the 3
consecutive years before the failure, using DA.
Three years before the failure
S(F) = -2.0081765 + X4*2.2476778 + X5*2.529652
Scores for the centers of groups are: - class 1 = -0.33467
- class 2 = 0.33467
X4= Earnings before interests and taxes / financial charges
X5= Own funds / Total debt
S(F): Score of Failure
where:
“if S(F) > 0 then the corporate is classified as failed,
on contrary as non-failed”
Two years before the failure
S(F) = -1.4674983 + X2*2.3656861 + X3*-8.1494602 + X5*3.1317845
Scores for the centers of groups are: - class 1 = -0.63454
- class 2 = 0.63454
X2= Current assets / Total assets
X3= Net result / Total assets
X5= Own funds / Total debt
X7= Stocks / Sales
S(F): Score of Failure
where:
“if S(F) > 0 then the corporate is classified as failed,
on contrary as non-failed”
One year before the failure
S(F) = -0.5673571 + X3*9.364902 + X5*0.5510232
Scores for the centers of groups are: - class 1 = -0.75698
- class 2 = 0.75698
18
19.
X3= Net result / Total assets
X5= Own funds / Total debt
S(F): Score of Failure
where:
“if S(F) > 0 then the corporate is classified as failed,
on contrary as non-failed”
Logistic Regression (LR)
Experiments with this paradigm were carried out using the SAS software. The following
are the classifiers, obtained with a sample of 120 companies, for each one of the 3 consecutive
years before the failure, using LR. The following probability function:
P(F) =
e
g X
e
g X
( )
( )
1+
is the same for each one of the three models.
where: P(F) is The Failure Probability and
“if P(F) > 0.5 then the corporate is classified as failed,
on contrary as non-failed”
Three years before the failure
g(X) = -1.3287 + X4*1.4877 + X5*1.6896
X4= Earnings before interests and taxes / financial charges
X5= Own funds / Total debt
Two years before the failure
g(X) = -0.4999 + X3*-13.5892 + X5*3.8058
X3= Net Result / Total Assets
X5= Own funds / Total debt
One year before the failure
g(X) = -1.2854 + X3*26.1304 + X5*1.3535
X3= Net Result / Total Assets
X5= Own funds / Total debt
Rule Induction (CN2)
Page - 19 -
20.
Experiments with this paradigm were carried out using the CN2 algorithm. The
following are classifiers obtained with a sample of 120 companies as a training set for each one
of the 3 consecutive years before the failure, using CN2 algorithm. Models are represented with
ordered if-then rules lists, as it can be seen in Figure 10, 11, and 12.
Three years before the failure
if ((x4 < 0.70) && (x8 < 0.11)) then X=NON-FAILED
else if (( x2 > 0.52 ) && (x3 > 0.01) && ( x5 > 0.68 ) && (x9 > 0.04)) then X=FAILED
else if (( x2 > 0.32 ) && ( x4 > 0.76 ) && ( x9 < 0.15 )) then X=FAILED
else if (( x1 > 0.06 ) && ( x2 < 0.51 ) && ( x8 > 0.17 )) then X=NON-FAILED
else if (( x1 > 0.16 ) && ( x5 < 0.56 ) && ( x7 < 0.09 ) && (x8 > 0.19)) then X=FAILED
else if (( x2 > 0.88 ) && ( x6 > 0.22 )) then X=NON-FAILED
else if (( x2 < 0.87 ) && ( x6 < 0.19 ) && ( x7 > 0.14 )) then X=NON-FAILED
else if (( x4 < 0.39 ) && ( x6 < 0.26 )) then X=FAILED
else if (( x1 < 0.98 ) && ( x6 < 0.41 )) then X=NON-FAILED
else if (( x2 > 0.74 ) && ( x3 > 0.02)) then X=FAILED
else if ((x4 < 0.61) && (x9 > 0.19)) then X=NON-FAILED
else if ((x3 < 0.25) && ( x9 > 0.08 )) then X=FAILED
else if (( x1 < 0.90 ) && ( x4 > 0.08) && ( x7 > 0.05)) then X=NON-FAILED
else X=FAILED
Figure 10. Rule Induction model for three years before the failure
Two years before the failure
if ((x4 > 0.17) && (x5 > 0.50) && (x8 > 0.19)) then X=FAILED
else if (( x3 > 0.09 ) && ( x8 > 0.20 )) then X=NON-FAILED
else if (( x1 > 0.06 ) && ( x7 > 0.30 ) && ( x7 < 0.44 )) then X=NON-FAILED
else if (( x1 > 0.02 ) && ( x6 > 0.75 ) && ( x9 < 0.19 )) then X=FAILED
else if (( x1 > 0.25 ) && ( x4 > 0.06 ) && ( x9 < 0.05 )) then X=NON-FAILED
else if (( x1 > 0.11 ) && ( x2 > 0.74 ) && ( x9 < 0.24 )) then X=FAILED
else if (( x4 > 0.08 ) && ( x7 > 0.22 ) && ( x7 < 0.34 )) then X=NON-FAILED
else if (( x1 < 0.72 ) && ( x2 > 0.48 ) && ( x2 < 0.71 ) && ( x3 < 0.10 )) then X=FAILED
else if (( x3 < 0.10 ) && ( x6 > 0.10 )) then X=NON-FAILED
else if ( x2 > 0.69 ) then X=FAILED
else X=NON-FAILED
Figure 11. Rule Induction model for two years before the failure
One year before the failure
20
21.
if (x4 < 0.80) then X=NON-FAILED
else if (( x1 > 1.41 ) && ( x4 > 0.87 ) && ( x7 < 0.36)) then X=FAILED
else if (( x4 > 0.95 ) && ( x5 < 0.22 )) then X=NON-FAILED
else if (( x5 < 0.36 ) && ( x8 > 0.03 )) then X=FAILED
else if (( x4 < 1.90 ) && ( x6 > 4.69 )) then X=NON-FAILED
else if (( x1 > 0.81 ) && ( x1 < 1.38 ) && ( x7 < 0.23 )) then X=FAILED
else if (( x8 > 0.17 ) && ( x7 < 0.67 )) then X=NON-FAILED
else X=FAILED
Figure 12. Rule Induction model for one year before the failure
The failure probability is always 100%, whether the corporate is classified as failed or
as non-failed.
For The Classification Trees and Bayesian Networks paradigms, variables has been
categorized into three categories. The following is the explanation of how this process has been
carried out for each one of the 3 years. For each group of 60 healthy companies and 60 of failed
companies, and for each one of the nine ratio-variables means were calculated, in order to replace
the missing values that the original sample contained. Later, for each ratio-variable, using
FREQUENCIES and DESCRIPTIVES procedures from the SPSS software, their distributions
were analyzed. And finally, the values of each ratio-variable were separated in three categories -
1, 2, and 3 -, according to their cumulative percentages. The next figure shows the process of
categorization.
*******************"3 years before the failure"************************
x1 (lowest through 0.30=1) (0.31 through 0.66=2) (0.67 through highest=3)
x2 (lowest through 0.63=1) (0.64 through 0.77=2) (0.78 through highest=3)
x3 (lowest through 0.02=1) (0.03 through 0.06=2) (0.07 through highest=3)
x4 (lowest through 0.21=1) (0.22 through 0.61=2) (0.62 through highest=3)
x5 (lowest through 0.24=1) (0.25 through 0.51=2) (0.52 through highest=3)
x6 (lowest through 0.32=1) (0.33 through 0.68=2) (0.69 through highest=3)
x7 (lowest through 0.11=1) (0.12 through 0.22=2) (0.23 through highest=3)
x8 (lowest through 0.21=1) (0.22 through 0.32=2) (0.33 through highest=3)
x9 (lowest through 0.07=1) (0.08 through 0.16=2) (0.17 through highest=3)
******************"2 years before the failure"************************
x1 (lowest through 0.26=1) (0.27 through 0.62=2) (0.63 through highest=3)
x2 (lowest through 0.61=1) (0.62 through 0.74=2) (0.75 through highest=3)
x3 (lowest through 0.02=1) (0.03 through 0.06=2) (0.07 through highest=3)
x4 (lowest through 0.24=1) (0.25 through 0.49=2) (0.50 through highest=3)
Page - 21 -
22.
x5 (lowest through 0.20=1) (0.21 through 0.43=2) (0.44 through highest=3)
x6 (lowest through 0.31=1) (0.32 through 0.59=2) (0.60 through highest=3)
x7 (lowest through 0.13=1) (0.14 through 0.24=2) (0.25 through highest=3)
x8 (lowest through 0.24=1) (0.25 through 0.36=2) (0.37 through highest=3)
x9 (lowest through 0.05=1) (0.06 through 0.12=2) (0.13 through highest=3)
*******************"1 year before the failure"***********************
x1 (lowest through 1.05=1) (1.06 through 1.46=2) (1.47 through highest=3)
x2 (lowest through 0.63=1) (0.64 through 0.77=2) (0.78 through highest=3)
x3 (lowest through -0.01=1) (0.0 through 0.04=2) (0.05 through highest=3)
x4 (lowest through 0.89=1) (0.90 through 1.89=2) (1.90 through highest=3)
x5 (lowest through 0.32=1) (0.33 through 0.85=2) (0.86 through highest=3)
x6 (lowest through 2.89=1) (2.90 through 5.89=2) (5.90 through highest=3)
x7 (lowest through 0.11=1) (0.12 through 0.22=2) (0.23 through highest=3)
x8 (lowest through 0.21=1) (0.22 through 0.32=2) (0.33 through highest=3)
x9 (lowest through -0.05=1) (-0.04 through 0.05=2) (0.06 through highest=3)
Figure 6. The process of categorization of ratio variables
Classification Trees (CART)
To carry out experiments with this paradigm the SPAD.N and the SPAD.S software
have been used. The following are the classifiers obtained with a sample of 120 companies as a
training set, for each one of the 3 consecutive years before the failure, using the procedure
DISAR. The models that represent these classifiers are the binary classification trees shown in
Figures 7, 8 and 9.
Three years before the failure
Node 1
Size = 120
H=60 and F=60
X7 /= 1 X 7= 1
Node 2 Node 3
Size = 84 Size = 36
22
23.
H=48 and F=36 H=12 and F=24
X5 = 3 X5 /= 3 X4 = 3 X4 /= 3
Node 4 Node 5 Node 6 Node 7
Size = 26 Size = 58 Size = 13 Size = 23
H=11 and F=15 H=37 and F=21 H=2 and F=11 H=10 and F=13
FINAL NODE
X1 /= 1 X1 = 1 X1 /= 2 X1 = 2 X2 = 3 X2 /= 3
Node 8 Node 9 Node 10 Node 11 Node 14 Node 15
Size = 16 Size = 10 Size = 43 Size = 15 Size = 4 Size = 19
H=4 and F=12 H=7 and F=3 H=4 and F=12 H=12 and F=3 H=0 and F=4 H=10 and F=9
FINAL NODE FINAL NODE
X6 = 3 X6/= 3 X2 /= 1 X2 = 1 X5 = 2 X5/= 2
Node 16 Node 17 Node 18 Node 19 Node 20 Node 21
Size = 7 Size = 9 Size = 6 Size = 4 Size = 21 Size = 22
H=0 and F=7 H=4 and F=5 H=3 and F=3 H=4 and F=0 H=9 and F=12 H=16 and F=6
FINAL NODE FINAL NODE
X8 /= 2 X8 = 2 X8 /= 1 X8 = 1 X7 /=3 X7 = 3
Node 34 Node 35 Node 36 Node 37 Node 40 Node 41
Size = 7 Size = 2 Size = 4 Size = 2 Size = 9 Size = 12
H=2 and F=5 H=2 and F=0 H=3 and F=1 H=0 and F=2 H=1 and F=8 H=8 and F=4
FINAL NODE FINAL NODE FINAL NODE
X3 /= 2 X3 = 2 X3 = 3 X3 /= 3 X3 = 3 X3 /= 3
Node 68 Node 69 Node 80 Node 81 Node 82 Node 83
Size = 4 Size = 3 Size = 1 Size = 8 Size = 2 Size = 10
H=0 and F=4 H=2 and F=1 H=1 and F=0 H=0 and F=8 H=0 and F=2 H=8 and F=2
FINAL NODE FINAL NODE FINAL NODE FINAL NODE FINAL NODE FINAL NODE
Node 21
Size = 22
H=16 and F=6
X3 = 3 X 3 /= 3
Node 42 Node 43
Size = 7 Size = 15
H=7 and F= 0 H=9 and F= 6
FINAL NODE
X2 = 3 X2 /= 3
Node 86 Node 87
Size = 8 Size = 7
Page - 23 -
24.
H=6 and F=2 H=3 and F=4
X6 /= 2 X6 = 2 X1 = 3 X /= 3
Node 172 Node 173 Node 174 Node 175
Size = 7 Size = 1 Size = 4 Size = 3
H=6 and F=1 H=0 and F=1 H=3 and F=1 H=0 and F=3
FINAL NODE FINAL NODE FINAL NODE FINAL NODE
Node 15 Node 121
Size = 19 Size = 8
H=10 and F=9 H=5 and F=3
X9 /= 2 X9 = 2 X3 = 3 X3 /= 3
Node 30 Node 31 Node 242 Node 243
Size = 16 Size = 3 Size = 6 Size = 2
H=10 and F=6 H=0 and F=3 H=10 and F=6 H=2 and F=0
FINAL NODE FINAL NODE
X3 /= 1 X3 = 1 X9 = 3 X9 /= 3
Node 60 Node 61 Node 484 Node 485
Size = 13 Size = 3 Size = 4 Size = 2
H=7 and F=6 H=3 and F=0 H=1 and F=3 H=2 and F=0
FINAL NODE FINAL NODE FINAL NODE
X3 /= 1 X3 = 1
Node 120 Node 121
Size = 5 Size = 8
H=2 and F=3 H=5 and F=3
FINAL NODE
Figure 7. Classification Tree model for three years before the failure
Two years before the failure
Node 1
Size = 120
H=60 and F=60
X5 = 3 X 5 /= 3
Node 2 Node 3
Size = 41 Size = 79
H=11 and F=30 H=49 and F=30
X7 /= 1 X7 = 1 X3 = 3 X3 /= 3
Node 4 Node 5 Node 6 Node 7
Size = 25 Size = 16 Size = 27 Size = 52
H=11 and F=14 H=0 and F=16 H=23 and F=4 H=26 and F=26
FINAL NODE FINAL NODE
X4 /= 1 X4 = 1 X9 /= 1 X9 = 1
24
25.
Node 8 Node 9 Node 14 Node 15
Size = 16 Size = 9 Size = 37 Size = 15
H=4 and F=12 H=7 and F=2 H=15 and F=22 H=11 and F=4
FINAL NODE
X6 = 3 X6/= 3 X2 = 3 X2 /= 3 X4 /= 2 X4 = 2
Node 16 Node 17 Node 18 Node 19 Node 28 Node 29
Size = 10 Size = 6 Size = 3 Size = 6 Size = 27 Size = 10
H=4 and F=6 H=0 and F=6 H=1 and F=2 H=6 and F=0 H=14 and F=13 H=1 and F=9
FINAL NODE FINAL NODE FINAL NODE FINAL NODE
X1 /= 1 X1 = 1 X5 = 2 X5 /=2
Node 32 Node 33 Node 56 Node 57
Size = 8 Size = 2 Size = 14 Size = 13
H=2 and F=6 H=2 and F=0 H=5 and F= 9 H=9 and F=4
FINAL NODE FINAL NODE FINAL NODE
X9 = 3 X9 /=3
Node 112 Node 113
Size = 8 Size = 6
H=5 and F=3 H=0 and F=6
FINAL NODE
X3 = 2 X3 /=2
Node 224 Node 225
Size = 4 Size = 4
H=1 and F=3 H=4 and F=0
FINAL NODE FINAL NODE
Figure 8. Classification Tree model for two years before the failure
One year before the failure
Node 1
Size = 120
H=60 and F=60
X4 = 3 X 4 /= 3
Node 2 Node 3
Size = 40 Size = 80
H=3 and F=37 H=57 and F=23
X7 = 3 X7/= 3 X3 /= 1 X3 = 1
Node 4 Node 5 Node 6 Node 7
Size = 8 Size = 32 Size = 40 Size = 40
H=2 and F=6 H=1 and F=31 H=22 and F=18 H=35 and F=5
FINAL NODE FINAL NODE
X4 /= 1 X4 = 1 X2 /= 2 X2 = 2
Node 8 Node 9 Node 12 Node 13
Size = 7 Size = 1 Size = 27 Size = 13
H=1 and F=6 H=1 and F=0 H=19 and F=8 H=3 and F=10
Page - 25 -
26.
FINAL NODE FINAL NODE
X1 /= 1 X1 = 1 X9 /= 1 X9 = 1
Node 24 Node 25 Node 26 Node 27
Size = 16 Size = 10 Size = 8 Size = 5
H=8 and F=8 H=11 and F=0 H=0 and F=8 H=3 and F=2
FINAL NODE FINAL NODE FINAL NODE
X8 /= 1 X8 =1
Node 48 Node 49
Size = 10 Size = 6
H=8 and F= 2 H=0 and F=6
FINAL NODE
X9 = 3 X9 /=3
Node 96 Node 97
Size = 2 Size = 8
H=0 and F=2 H=8 and F=0
FINAL NODE FINAL NODE
Figure 9. Classification Tree model for one year before the failure
Where: F : represents class-failed (Failed)
H : represents class-non-failed (Healthy)
FINAL NODE: is a leaf node where the process ends, and where letter F or H
represents that for individual that comes at this node the class
assigned is failed or non-failed respectively.
The failure probability is calculated with
H
H F+
in the case when corporate is
classified as non-failed, reaching the leaf node, or with
F
H F+
when classified as failed.
Bayesian Networks
Experiments with Bayesian Networks were carried out using the HUGIN software,
Genetic Algorithm, and Markov Blanket. Models were obtained with a sample of 120 companies
as a training set for each one of the 3 years prior to failure are shown in Figure 13, 14 and 15.
Initially, P(X) = 0.50, which means that there is the same probability that the corporate
will fail as that it will not. After giving real values to the nine variables it’s destiny is predicted on
the following way:
- for three years before the failure:
“if P(X) > 0.30 then the corporate is classified as non-failed,
on contrary as failed”
26
27.
- for two years before the failure:
“if P(X) > 0.40 then the corporate is classified as non-failed,
on contrary as failed”
- for one year before the failure:
“if P(X) > 0.45 then the corporate is classified as non-failed,
on contrary as failed”
Three years before the failure
X9 = OCF º / TOTAL ASSETS X8 = DEBTORS / SALES
X9 = 1 29 % X8 = 1 33 %
X9 = 2 32 % X8 = 2 31 %
X9 = 3 39 % X8 = 3 36 %
X = NON-FAILED / FAILED X = 1 - P(X) = 50 %
X = 2 - P(X) = 50 %
X1 = 1 - 33 %
X1 = 2 - 34 %
X1 = 3 - 33 %
X1 = CURRENT ASSETS / CURRENT LIABILITIES
X6 = 1 - 33 % X4 = 1 - 33 %
X6 = 2 - 34 % X4 = 2 - 32 %
X6 = 3 - 32 % X4 = 3 - 35 %
X6 = SALES / OWN FUNDS X4 = EBIT / FINANC. CHARGES
Page - 27 -
30.
X5 = 2 - 33 %
X5 = 3 - 33 %
Figure 15. Bayesian Networks model for one year before the failure
Results obtained with the aforementioned models are resumed in the next two tables
with the percentage of well classified companies for each paradigm and period of time (1 year, 2
years, and 3 years before the failure).
Table 1 summarize these results regarding the goodness of fit of each one of paradigms
calculated by the 5-fold cross-validation method, explained in Section 2.
PARADIGMS
Years before
the failure
DA LR CART CN2 BAYESIAN
NETWORKS
1 78.33 82.50 79.17 80.0 60.83
2 69.17 69.16 60.00 66.66 62.00
3 55.00 55.00 45.00 57.50 60.83
Table 1. Results regarding the 5-fold cross-validation
As showed in Table 1, the more you go back in time the worse the results are. Though
in Bayesian Networks they remain more or less constants in time. These results can be compared
with the ones shown in Table 2. In it can be seen the goodness-of-fit of each paradigm calculated
with the second validation method proposed in Section 2.
PARADIGMS
Years before
the failure
DA LR CART CN2 BAYESIAN
NETWORKS
1 81.00 82.5 89.00 95.17 91.17
2 72.17 71.51 84.50 91.17 92.00
3 59.84 60.33 76.67 88.00 92.17
Table 2. Results regarding the second validation
In this case it is worth mentioning the good behavior of paradigms such as CN2 and
Bayesian Networks as well as the Classification Trees.
The goal of any algorithm for integration of classification results is to generate more
certain, precise and accurate results. Two experiments were conducted in order to compare the
performance of the aforementioned methods and combining techniques. The following two tables
summarize the results got applying combining methods - Voting By Majority Principle and
Bayesian Formalism -, to classification results when 5-fold cross-validation is used (explained in
Section 2).
PARADIGMS
Years before
the failure
DA LR CART CN2 BAYESIAN
NETWORKS
VOTING
BY
MAYORITY
1 78.33 82.50 79.17 80.0 60.83 88.33
30
31.
2 69.17 69.16 60.00 66.66 62.00 79.17
3 55.00 55.00 45.00 57.50 60.83 73.33
Table 3. Results regarding the 5-fold cross-validation and
Voting By Majority Principle
As expected Voting By Majority Principle has succeeded in improving classification
results, being better than the best individual classifier for any of the 3 years. These pretty good
results can be compared with the ones obtained applying Bayesian Formalism combining method,
whose results are shown in Table 4.
PARADIGMS
Years before
the failure
DA LR CART CN2 BAYESIAN
NETWORKS
BAYESIAN
FORMALISM
1 78.33 82.50 79.17 80.0 60.83 73.33
2 69.17 69.16 60.00 66.66 62.00 75.83
3 55.00 55.00 45.00 57.50 60.83 62.50
Table 4. Results regarding the 5-fold cross-validation and
Bayesian Formalism
As can be noticed, Bayesian Formalism presents better results that the best individual
classifier only for 3 and 2 years before the failure, but in the case of 1 year it is, surprisingly,
only better than the worst classifier.
It is interesting to mention, that when the second validation method is used to produce
classification results for individual classifiers these two combining methods reach 100% in
classification, for each one of the 3 years. It is too good to be truth, but still this situation could
be understood as, when more classifiers we have (in this case there are 25 models, 5 for each
paradigm) the more the possibility that the majority of them vote for the real class and that
Bayesian Formalism reaches 100% too.
In the following Section, comparison of results is presented, trying to reach some
conclusions about the performance of individual classifiers and their integration.
5. Conclusions and Future Work
Techniques coming from Statistics and Artificial Intelligence have been applied to the
problem of Corporate Failure prediction in different period of time.
The discrepancy between the results obtained when training set is with 96 of 120
companies and the test set is with 120 of them, and the ones obtained when the 5-fold cross-
validation suggests the existence of a problem of overfitting in the Bayesian Networks (Markov
Blankets) and CN2 paradigms. It is possible that these problems could be relieved by penalizing
complex structures in the learning process of these paradigms - see Akaike’s criteria, Minimum
Description Length, … - and by trying to guide the search by the parsimony principle.
Page - 31 -
32.
It is interesting to see that Discriminant Analysis and Logistic Regression models select
the same ratio-variables as significant ones, for all of the three years, and that the percentages of
well-classified individuals are rather similar, being Logistic Regression classifier a little bit
better.
Observing Classification Trees it is easy to notice that, as the date of failure is closer the
depth of tree is smaller, which can be understood as, the closer the date of failure is less variables
are necessary to be examined, and sooner the conclusions about the “destiny” of the individuals in
testing set are reached.
This phenomenon is also present in ordered if-then-rule lists. Here the closer the date of
failure is less rules are needed to be checked out to determine the “destiny” of the corporate. For
three years before the failure there are thirteen rules in the list, ten for two years and just seven
for one year.
Bayesian Networks paradigm provides another aspect of interest. Observing the three
networks it can be seen that the closer the date of failure is more connected nodes are, - there are
more dependencies between the variables -, and the thresholds are greater.
Analyzing classification results when combining techniques are used it can be seen that
they really achieve their main goal, to improve the performance of individual models. Voting By
Majority Principle is quite better than Bayesian Principle, which for one year before the failure
only improve the performance of the worst individual one. Another aspect which deserves further
investigation is the development or implementation of procedures that would achieve even better
results that these two used in this project. There are several approaches. One of them is to use
another kind of information outputted by individual classifiers too, like probability that the class
for individual X is the right one. Another one is to guide the performance of integrated classifiers
with the goodness of each one of individual classifiers.
32
33.
ACKNOWLEDGEMENTS
The authors wish to thank Dr. Fermin Lizarraga for providing the original data sample
and his doctoral dissertation, which was the guidance for comprehension of the corporate failure
problem, and of the use of financial ratios for failure prediction. The authors acknowledge Miren
Josune Gallego for helpful guidance in the use of the software packages, specially with the SAS
software. The authors would also like to thank the Diputación Foral de Guipuzcoa since
this study was supported under grant no. OF 131/1997.
REFERENCES
• Altman, E.I. (1968). Financial Ratios, Discriminant Analysis and the Prediction of Business
Failure. Journal of Finance, 589-609.
• Andersen, S.K., Olesen, K.G., Jensen, F.V. and Jensen, F. (1989). HUGIN - a shell for
building Bayesian belief universes for Expert Systems. Eleventh International Joint
Conference on Artificial Intelligence, vol. I, 1128-1133.
• Argenti, J. (1976). Corporate Collapse: the Causes and Symptoms. McGraw-Hill. London.
• Beaver, W. (1966). Financial Ratios as Predictors of Failure. Empirical Research in
Accounting: Selected Studies. Supplement of Journal of Accounting Research, 71-111.
• Blum, M. (1974). Failing Company Discriminant Analysis. Journal of Accounting Research,
1-23.
• Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and
Regression Trees. Monterey, CA: Wadswooeth and Brooks.
• Casey, C.J. and Baztczak, N.J. (1985). Cash-Flow: it’s not the bottom line. Harvard Business
Review, 61-66.
• Castillo, E., Gutierrez, J.M. and Hadi, A.S. (1997). Expert Systems and Probabilistic
Network Models. Springer-Verlag.
• Cestnik, B., Kononenko, I. and Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation
tool for sophisticated users, in Bratko, I. and Lavrac, N. (Eds.) Progress in Machine
Learning, Sigma Press, Wilmslow.
• Clark, P. and Niblett, T. (1989). The CN2 Induction Algorithm, Machine Learning, 3(4),
261-283.
• Deakin, E.B. (1972). A Discriminant Analysis of Predictors of Business Failure. Journal of
Accounting Research, 167-179.
• Dizdarevic, S., Lizarraga F., Larrañaga P., Sierra B. and Gallego M.J. (1997). Statistical and
Machine Learning Methods in the prediction of Bankruptcy. International Meeting on
Artificial Intelligence in Accounting Finances and Taxes, Huelva, Spain, 85-100.
• Edmister, R.O. (1972). An Empirical Test of Financial Ratio Analysis for Small Business
Failure Prediction. Journal of Financial and Quantitative Analysis. Vol. 7, 1477-1493.
• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of
Eugenics, 7, 179-188.
• Goldberg, D.E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Wesley, Reading, MA.
• Ho T.K., Hull J.J. and Srihari S.N. (1994). Decision Combination in Multiple Classifier
Systems, IEEE Transactions on Pattern analysis and machine intelligence, Vol. 16. 1.
January, 66-75.
• Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression. Wiley Series in
Probability and Mathematical Statistics.
• Hunt, E.B., Marin, J. and Stone P.J. (1966). Experiments in Induction, Academic Press.
• Jensen, F.V. (1996). Introduction to Bayesian networks. University College of London.
Page - 33 -
34.
• Larrañaga, P., Murga, R., Poza, M. and Kuijpers, C. (1996a). Structure Learning of
Bayesian Networks by Hybrid Genetic Algorithms. Learning from Data: AI and Statistics
V, Lecture Notes in Statistics 112. D. Fisher, H.-J. Lenz (eds.), New York, NY: Spriger-
Verlag, 165-174.
• Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R. and Kuijpers, C. (1996b). Structure
Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of
Control Parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence,
18, 912-926.
• Larrañaga, P., Kuijpers, C., Murga, R. and Yurramendi, Y. (1996c). Bayesian Network
Structures by searching for the best ordering with genetic algorithms. IEEE Transactions
on System, Man and Cybernetics. Vol. 26, no. 4, 487-492.
• Larrañaga, P., Kuijpers, C., Murga, R., Yurramendi, Y., Graña, M., Lozano, J.A.,
Albizuri, X., D'Anjou, A. and Torrealdea, F.J. (1996d). Genetic Algorithms applied to
Bayesian Networks. A. Gammerman (ed.). Computational Learning and Probabilistic
Reasoning. John Wiley, 211-234.
• Larrañaga, P., Kuijpers, C., Poza, M. and Murga, R. (1997). Decomposing Bayesian
Networks by Genetic Algorithms. Statistics and Computing. No 7, 19-34.
• Lauritzen, S.L. (1996). Graphical models. Oxford Science Publications.
• Lauritzen, S.L., and Spiegelhalter, D.J. (1988). Local computations with probabilities on
graphical structures and their application on Expert Systems. J.R. Statist. Soc. B, vol. 50,
no. 2, 157-224.
• Libby, R. (1975). Accounting Ratios and the Prediction of Failure: Some Behavioral
Evidence. Journal of Accounting Research, 150-161.
• Lizarraga, F. (1996). Modelos Multivariantes de predicción del fracaso empresarial: una
aplicación a la realidad de la información contable española. Ph.D. Public University of
Navarra.
• Mensah, Y. (1983). The Differential Bankruptcy Predictive Ability of Specific Price Level
Adjustments: Some Empirical Evidence, Accounting Review, 228-246.
• Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (1994). Machine Learning, Neural and
Statistical Classification. Ellis Horwood Series in Artificial Intelligence. New York.
• Ohlson, J.A. (1980). Financial Ratios and the Probabilistic Prediction of Bankruptcy. Journal
of Accounting Research, 18, 1, 109-111.
• Patterson, A. and Niblett, T.(1983). ACLS user manual, Intelligent Terminals Ltd.,
Glasgow.
• Peel, M.J. and Peel, D.A. (1987). Some Further Empirical Evidence on Predicting Private
Company Failure. Accounting and Business Research, 18, 69, 57-66.
• Platt, H.D. (1985). Why Companies Fail: Strategies for Detecting Avoiding and Profiting
from Bankruptcy. Lexington Books. Massachusetts.
• Quinlan, J.R. (1979). Discovering rules by induction from large collection of examples, in
Expert systems in the micro electronic age, Edinburgh University Press.
• Quinlan, J.R. (1986). Induction of Decision Trees, Machine Learning, 1(1), 81-106.
• SAS Institute Inc. (1993). SAS Language: Reference, Version 6, SAS Institute Inc.
• Scott, J. (1981). The probability of Bankruptcy: A Comparison of Empirical Predictions and
Theoretical Models. Journal of Banking and Finance, 317-344.
• Sierra B. and Larrañaga P. (1997). Searching for the optimal Bayesian Network in
classification tasks by Genetic Algorithms, WUPES 97, 144-154.
• SPAD.S (1994). Version 2.5 Sistema Compatible para el Análisis de Datos.
• SPSS Inc. (1990). SPSS Reference Guide, SPSS INC.
• Stone, M. (1974). Cross-validation choice and assessment of statistical procedures. Journal
of Royal Statistical Society, n° 36, 111-147.
• Taffler, R. (1982). Finding those Firms in Danger. Accountancy Age, 16.
• The Turing Institute, The CN2 Induction Algorithm (1988), The Turing Institute, 36 N.
Hanover St., Glasgow, GI 2AD, U.K. October.
34
35.
• Xu L., Kryzak A. and Suen C.Y.(1992). Methods of Computing Multiple Classifiers and
Their Applications to Handwriting Recognition. IEEE Transactions on Systems, Man and
Cybernetics, Vol. 22. no. 3., 418-435.
• Zavgren, C. (1985). A Probabilistic Model of Financial Distress. Ph.D. The University of
Nebraska.
Page - 35 -
Be the first to comment