Loading…

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

Like this document? Why not share!

Like this? Share it with your network

Share

ART1197.DOC

on

  • 328 views

 

Statistics

Views

Total Views
328
Views on SlideShare
328
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

ART1197.DOC Document Transcript

  • 1. COMBINING STATISTICAL AND MACHINE LEARNING BASED CLASSIFIERS IN THE PREDICTION OF CORPORATE FAILURE S. Dizdarevic, P. Larrañaga, B. Sierra, J.A. Lozano, J.M. Peña Department of Computer Science and Artificial Intelligence University of the Basque Country Abstract This project presents the application of methods coming from Statistics as well as from an area of the Artificial Intelligence called Machine Learning, in the problem of the corporate failure prediction. The empirically compared paradigms applied to a sample of 120 Spanish companies, 60 of which had gone bankrupt, and 60 had not, are Discriminant Analysis, Logistic Regression, Classification Trees, Rule Induction and Bayesian Networks. Two Artificial Intelligence techniques - Voting by Majority Principle and Bayesian Formalism -, are implemented in order to obtain prediction improvement over the single models that are compared. The predictor variables that gather the accountant information taken for every company over the three years previous to the date of survey are financial ratios.
  • 2. 1. Introduction Corporate failure prediction, through classification of known cases and generalization to other cases, has been a subject of study for almost 30 years. Accurate prediction of corporate failure is important to investors, creditors and auditors. It also can help shareholders, creditors and governments to avoid heavy losses stemming from surprise bankrupts. Therefore, using analytic tools and data from corporate financial reports, one can evaluate and predict its future financial status. Although the idea of a company going bankrupt is associated with its disappearance, before this really happens the company has gone through a long period of crisis with several stages in it. Many authors comprise them into two main stages taking into account the two senses of the concept of failure: economic and financial. The economic failure starts when the profitability of the invested capital is under its costs of opportunity, receiving its owner an investment yield lower than other alternative opportunities with the same risk. As the economical failure advances and settles down in the company, the incomes start to be lower than the expenses appearing the first negative results. The deterioration produced during the economic failure process, if not corrected, will drive the company into technical insolvency. This is the first stage of what is called the financial failure. In this situation the company does not have enough liquid for the payments as these are increasing. The breaking point of this ruinous process will be reached when the company is not only unable to pay off its falling dues but also in a situation of negative net patrimony. This means that its current liabilities are greater than the value of its assets, so it could soon lead the company to its disappearance. The study of the corporate failure must be dealt always through the search of its causes that we can analyze through their visible symptoms. As Argenti (1976) proposes, it is very interesting to know the reasons why others companies have failed in order to avoid them in ours. Nevertheless, the capture of these causes is impossible if not through the discovery of their symptoms. Some of these causes are the following: management fault, deficiency in the systems of accounting information, disability of adaptation to the changes of environment, overtrading, the launch of big projects, abuse of financing by debt, the actual risks from the business world. As for the symptoms, Argenti accepts as such, the deterioration suffered by the financial ratios, as the corporate comes closer to failure, and indicating also that accounting manipulation is one clear symptom of the failure. According to Platt (1985) different sources that an external economic agent can use to detect the aforementioned symptoms, can be grouped together into three sets of sources: the common sense, the analysis of statement of accounts published publicly by the companies and statistical tools. Page - 2 -
  • 3. The use of common sense, although a very simple strategy, has the following advantages: it does not need sophisticate computers and there is no need of assimilation of financial conditions, sometimes not easily understandable. All you should do is to pay attention to the daily reality of the corporate and its environment. Some signs of getting close to a situation of crisis are: auditor change, development of the relation with the new value, the members of council resign suddenly, credit lines are reduced or canceled, the sale of stocks done by the directors of the board, the appreciation of its stocks in the market to the prices inferior to its accounting value, excess of stock,… The analysis of the statement of accounts is part of a process of information whose aim is to provide data for decision making. The idea of failure, and more precisely, the idea of insolvency has remained connected to the technique of accounting ratios. It was thought that the ratios are worsen as the corporate approached the crisis process, in this way the deterioration suffered by the corporate could be measured. Due to big complicity of information and comprehension contained in financial statement data, the analysis of financial ratios, which gather all this information, has been the most used technique. The great interest in comparison between different companies (industrial sector, size,…) has influenced its use. There are two main difficulties related to financial ratios, their creation and their interpretation. Another difficulty added to the previous ones is that the same value of ratio for two companies from different sectors may represent different situations. The financial information gathered in ratios has to be homogenized, so that it could be used for description and prediction of corporate failure. The second task is directly related to the use of Statistics. Although ignored for half a century by the analysts, nowadays the use of statistical techniques have became a helpful tool commonly used because they give objectivity to the analysis. Beaver (1966) was among the pioneers who used them for the analysis of financial ratios in order to predict corporate failure. In his work, starting from 30 variables-ratio taken from 79 pairs of companies, failed and non-failed, 6 variables-ratio are selected. An analysis of profiles is based upon them by comparing the means of the values of every ratio in each group, failed and non-failed, and observing the important differences, five years before the date of failure. Beaver developed a dichotomic heuristic test of classification for every ratio by using a process of trial and error that allowed him to choose the suitable cut-point for every ratio and every year that minimized the errors of classification. Nevertheless the univariate model of Beaver contrasts with the inherent character of multivariable documents of the financial situation. Therefore, in order to make the above mentioned documents valuables, they will have to be interpreted from a perspective that allows to think over the several financial aspects of a corporate as a whole. The search of this perspective has been the reason why several researchers have used multivariate statistical techniques for the corporate failure prediction. Altman (1968) was the pioneer in application of Discriminant Analysis to the aforementioned problem obtaining surprising results. The lineal combination of five ratios in a profile created a score capable of discriminating between “healthy” and “failed” companies with very high percentages of success in the two years previous to the failure. The initial work of Altman was adapted, updated and improved by several researchers. It is worth noticing the works of Deakin (1972), Blum (1974), Edmister (1972), Libby (1975), Scott (1981), and Taffler (1982). Page - 3 -
  • 4. The necessity of a statistical alternative to avoid the problems related to the Discriminant Analysis leads to the use of models of conditional probability, logit and probit, more flexible in their requirements. Ohlson (1980) is considered as the first author who published a model for the prediction of failure based on conditional probability models. Though he had no brilliant results his methodology was followed by other authors: Mensah (1983), Zavgren (1985), Casey and Baztczak (1985), and Peel and Peel (1987). This chapter is organized as follows. Section 2 presents the features of the case study used for carrying out the empirical comparison among several paradigms coming from Statistics and Artificial Intelligence and the combining techniques. These paradigms are explained in Section 3. Section 4 shows the results obtained for every method in terms of the percentage of well-classified companies, as well as models descriptions and analysis of the results. In Section 5 it finishes with the conclusion of the work, proposing further research. 2. Problem Description Starting from the hypothesis that the accounting information pattern of non-failed and failed companies are different, the fundamental aim of this chapter was to show by means of an example how to create models, able to predict in advance (1 year, 2 years and 3 years) the failure of companies. These models could be considered as normative systems as they are founded on the probability theory. Following the recent progressive research in Artificial Intelligence two techniques have been implemented and used for integration of individual models in one, in order to improve predictive ability of every one. In this section the problem is presented, dealing with aspects of it such as, the concept of failure, sample obtaining and validating, selection of financial ratios with which models can be constructed, and sample for multiple models. A more detailed description of the failure problem can be found in Lizarraga (1996), which could be considered as one that inspired elaboration of this project. The data sample used here was the same that Lizarraga gathered from several Provincial Trade Register Offices and used for empirical comparison in his doctoral dissertation. The following is the procedure of how the data sample of 120 companies was selected and formed. The need of determining the concept of failure to use was the first methodological problem to solve. Finally, he chose the concept of suspension of payments, given that it is related not with a specific financial problem but with a situation of profound economic crisis. This concept presents three fundamental advantages: objectivity, it gives a representative date of the moment of failure and the large increment in the number of companies which had to turn to it in the period of study. Finally, the availability of the annual accounts deposited in the several Provincial Trade Register Office was another aspect that helped Lizarraga to carry out the empirical work of information gathering. Page - 4 -
  • 5. The sample was made of 120 companies, half of them belonged to a group of companies classified as “failed” and the other half was classified as “healthy” in order to incorporate them to the analysis. The selection was carried out by a matching process. Using a list of “failed” companies previously selected, matching them with a “healthy” corporate of the same size and industrial sector. This matching process is justified by the convenience of avoiding any possible distortion effect related with the size and industrial sector. As the access to each Provincial Trade Register Office was not possible Lizarraga decided to reduce the scope to the 10 provinces with larger number of records of payment suspension requested during the period of the study. These selected provinces gathered the 63% of the total number of records of payment suspension. The information was gathered through the Official Bulletin of the Trade Register Office. The interval of time was of 18 months (from January 1993 to July 1994), and it can be regarded as representative of a period of severe crisis among the companies in Spain. For every company in the study the economical and financial data corresponding to the three years previous to the end of the study were obtained. Though Lizarraga (1996) validated the model created using a sample of 44 companies (22 “healthy” and 22 “failed”) different from the ones used to construct the models, in this study the models are created using a sample of 120. A validation of the original model, based in the 5-fold cross-validation (Stone 1976) has been carried out, as well as another validation methodology which will be explained later. Estimates of the goodness of fit of every model, that is, the probability of the model classifying correctly, are calculated following the next steps: first, the sample is randomly ordered and then partitioned in 5 disjointed subsets. Secondly, choosing 4 of those subsets as training set, a model - which was tested with the fifth subset -, is obtained. These steps are repeated five times, using every time a different testing set    and obtaining this way 5 percentages p 1 , p 2 , …, p 5 of well-classified cases, as well as the “destiny” (failed or non failed) of each one of 120 companies distributed between five  disjointed test sets. The estimation of p , the probability with which the model created with the original sample classifies correctly is given by the following formula: 5  pi  p . i 1 5 Figure 1 shows graphically the process of estimation of the goodness of fit of the model with 5-fold cross-validation. Training 1  model 1 Page - 5 -
  • 6. Test 1   p1 Training 2  model 2 Test 2   p2 Training 3  model 3 MODEL 5  pi  p i 1 5 Test 3   p3 Training 4  model 4 Test 4   p4 Training 5  model 5 Test 5   p5 Figure 1. Process of estimation of the goodness of fit of the model with 5-fold cross-validation The other approach to model validation presented in this chapter is the next one. Using 4 of 5 disjointed subsets (got with 5-fold cross-validation) as training set a model is obtained and later tested with a sample of the 120 companies, instead of 24 used in previous approach.  The same process of estimation of p - the probability with which the model created with the original sample classifies correctly -, explained before is used. Figure 2 shows graphically the process of estimation of the goodness of fit of the model with this validation. Page - 6 -
  • 7. Training 1  model 1 Test 1   p1 Training 2  model 2 Test 2   p2 Training 3  model 3 MODEL 5  pi  p i 1 5 Test 3   p3 Training 4  model 4 Test 4   p4 Training 5  model 5 Test 5   p5 Figure 2. Process of estimation of the goodness of fit of the model with the second validation Lizarraga selected the explanation variables from the certificate of the Trial Balance, from the profit and loss account and from the financial chart. The election of these financial ratios was based on two main criteria: i) they were frequently mentioned in the literature treating the analysis of financial statements, II) the possibility of calculating them in a direct way or with simple adjustments. Starting from the 50 variables that Lizarraga had obtained in the previous selection, a Principal Components Factorial Analysis was made. It allowed to reduce them to 9 ratios which were used. These ratios were: X1. CURRENT ASSETS / CURRENT LIABILITIES X2. CURRENT ASSETS / TOTAL ASSETS X3. NET RESULT / TOTAL ASSETS X4. EARNINGS BEFORE INTEREST AND TAXES / FINANCIAL CHARGES X5. OWN FUNDS / TOTAL DEBT Page - 7 -
  • 8. X6. SALES / OWN FUNDS X7. STOCKS / SALES X8. DEBTORS / SALES X9. OPERATIVE CASH FLOW º / TOTAL ASSETS X1 - CURRENT ASSETS / CURRENT LIABILITIES ratio measures the firm’s ability to meet maturing short-term obligations. It is used to measure liquidity - if the firm has sufficient cash to meet short-term claims. X2 - CURRENT ASSETS / TOTAL ASSETS ratio represents the position of current assets (money, bank, quick assets, stocks…) in total assets (current assets + fixed assets). X3 - NET RESULT / TOTAL ASSETS ratio is a profitability ratio which explains the extend to which a company earns the highest return possible for resources used or capital employed, consistent with the desire and ability to assume risk. X4 - EARNINGS BEFORE INTEREST AND TAXES / FINANCIAL CHARGES ratio is calculated by dividing the earnings before interest and tax reductions by the financial charges of a firm. In essence, it is a measure of the true productivity of the firm’s charges. Earnings before interest and taxes is calculated as: earnings before taxes + financial charges. X5 - OWN FUNDS / TOTAL DEBT ratio is usually used to measure debt. It is calculated by dividing its funds by total debt. Own funds are those funds for financing that consist of capital that the owner deposited at the beginning of firm’s existence, reserves, amortization, subventions …. X6 - SALES / OWN FUNDS ratios is calculated by dividing sales by own funds. Sales represents net income of firm’s business. X7 - STOCKS / SALES is ratio used to observe rotation of stocks. It is calculated by dividing stocks (merchandise, raw materials, products terminated and semi-terminated…) by sales. It represents percentage of stocks in sales. X8 - DEBTORS / SALES ratio used to observe rotation of current assets. It is calculated dividing debtors by sales. X9 - OPERATIVE CASH FLOW º / TOTAL ASSETS ratio is calculated by dividing total assets into operative cash flow. This last value is calculated with the following formula: OCF º = traditional cash flow ( net result + amortization + large term reserves ) - increment of exploitation of current assets + decrement of exploitation of current liabilities. The several classificatory models constructed - described in the following section - use the aforementioned ratios. The process of sample for obtaining the combined classifiers is conducted as follows. Regardless what internal structure an individual classifier has, and methodology it bases on, a classifier may be regarded as a function that receives an input individual X and outputs a class j , or in short denoted by e(X) = j. In the problem of corporate failure there are only two classes: non-failed and failed, represented in our case by class 1 and by class 2 respectively. As explained before an individual X is represented by the set of nine ratio-variable values. Even though some classifiers may supply some extra information like, probability with which the class j is assigned to the individual X, only class j was considered as the output information. The decision to use this approach was based on its generality, as the individual classifiers could Page - 8 -
  • 9. be very different from each other in their methodologies and this kind of information is common for all of them. A good summary of existing techniques for combining classification results could be found in Xu et al. (1992) and Ho et al. (1994). Having the result (class j) for every classifier, these are now considered knowledge from different sources and used as input data for the combined classifiers. As explained before, two validation methods are used. Figure 3 shows graphically the process of individual classifiers output data’s integration when 5-fold cross-validation is used. PARADIGMS individuals DA LR CN2 CART Bayes test 1…24 set 1 test 25…48 set 2 test 49…72 set 3 test 73…96 set 4 test 97…120 set 5 Figure 3. Process of construction of the sample for combined classifier with 5-fold cross-validation Elements of pair (test set i , paradigm j), where i = 1 ... 5; and j = 1 … 5, represent the classes that model i , created with training set i , using paradigm j, assigns to each one of 24 individuals of the test set i. Figure 4 shows graphically the same process but when using the second validation explained before. PARADIGMS DA LR CN2 CART Bayes individuals M1 … M5 M1 … M5 M1 … M5 M1 … M5 M1 … M5 1 2 : : 120 Figure 4. Process of construction of the sample for combined classifier with the second validation The pair ( individual i, ( paradigm j, model k )) , where i = 1 ... 120; j = 1 … 5; and k = 1 … 5, represents the class that model k (M k) created with training set k , using paradigm j, assigns to the individual i. Page - 9 -
  • 10. Methodologies used to integrate individual classifiers that use aforementioned matrixes are described in the following section. 3. Methodologies 3.1 Methodologies - Individual Classifiers Supervised classification is one of the tasks most frequently carried out by the so-called Intelligent Systems. Thus, a large number of paradigms developed either by Statistics (Logistic Regression, Discriminant Analysis, K-N-N) or by Artificial Intelligence (Neural Networks, Rule Induction, Classification Trees, Bayesian Networks) are capable of carrying out classification tasks. The comparison among the previous paradigms cannot be carried out in an analytical way, thus it is usual to compare them empirically, - Michie et al. (1994). Five of the previous paradigms have been selected to make this comparison - Discriminant Analysis, Logistic Regression, Classification Trees, Rule Induction and Bayesian Networks. 3.1.1 DISCRIMINANT ANALYSIS Discriminant Analysis - introduced by Fisher (1936) - is a statistical technique used in Pattern Matching. The main use of Discriminant Analysis is to predict group membership from a set of predictors. Discriminant function analysis consists of finding a transformation which gives the maximum ratio of difference between a pair of group multivariate means to the multivariate variance within the two groups. Accordingly, an attempt is made to delineate based upon maximizing between group variance while minimizing within group variance. The predictors characteristics are related to form groups based upon similarities of the distribution in the p-dimensional space, which are then compared to groups which are input by user as truth. This enables the user to test the validity of groups based upon actual data, to test groups which have been created, or to put objects into groups. Two things can then be done with Discriminant Analysis (DA) : analysis and classification. Analysis is related to how the material is to be interpreted i.e., are there any differences between the groups and which variables make these differences?. Classification is used to find a set of variables - which provide satisfactory discrimination - so that classification equations can be derived, their use checked out through hit/rate tables, and if good, they can be used to classify new subjects who were not in the original analysis. DA creates a discriminant lineal function starting from the predictor variables x1, x2, …, xp and the results of this function are used later as a classification criteria. The discriminant function takes this form: Z = a1x1 + a2x2 + … +apxp xj are explaining variables (j = 1, 2 , …, p), aj real coefficients estimated by the model (j = 1, 2 , …, p), Z discriminant score. The discriminant function is used for classifying new individuals starting from rules in the following way: Page - 10 -
  • 11. i i i “If a1 x1 a 2 x2 ... a p xp c then individual i is classified as 0, on the contrary is classified as 1.” The procedure PROC DISCRIMINANT of SPSS software has been used in this project. The method selected to model construction was MAHAL, and variables were selected using stepwise selection. Selection rule was: maximize minimum Mahalanobis distance between groups. For each model some of the p variables were selected, and their corresponding coefficients were calculated. 3.1.2 LOGISTIC REGRESSION Logistic Regression (Hosmer and Lemeshow (1989)) is a method coming from Statistics whose objective is to obtain a functional relationship between a transformation - from a qualitative variable - called logit and p predictor variables which can be either quantitative or qualitative. It is used to develop a model which attempts to adjust the best and to be sufficiently reasonable to describe the relation between the result (dependent variable or the response) and the set of independent (or explanatory) variables. The fundamental characteristic of this regression is that the dependent variable is dichotomic. Mathematically the function used in logistic distribution is extremely flexible and easy to use. Indicating the dichotomic variable to predict by Y and the p predictor variables by x1, ..,xp, the objective is to determine the coefficients 0 , 1 ,..., p in order to satisfy the logit transformation formula: P(Y 1) g(X) = ln 0 1 1 x  p xp P(Y 0) The coefficients ( 0 is the intercept and 1 ,..., p are the p parameters) determination is carried out by the method of maximum likelihood. These coefficients are interpreted in terms of "odd-ratios", and the variables selection can be carried out by three methods: forward, backward or stepwise. Stepwise method is the most commonly used one. In it the variables are selected to be included or excluded from a statistical criteria. After obtaining g(X), a classification model can be constructed. The logistic regression model is described by the following formula: g( X ) e (x) = g( X ) . 1 e It is used to classify new individuals starting from rules in the following way: “ If (x) > c then individual i is classified as 0, on the contrary is classified as 1”. For this work the PROC LOGISTIC procedure of the SAS software has been used for the empirical comparison. Page - 11 -
  • 12. 3.1.3 RULE INDUCTION In the task of constructing Expert Systems, systems for inducing concept descriptions from examples have proved useful in easing the bottleneck of knowledge acquisition. One of these induction systems, CN2 (The Turing Institute (1988)), has been used as a representative of the approach called Machine Learning. CN2 was designed for the efficient induction of simple, comprehensive production rules in domains where problems of poor description language and/or noise may be present. CN2 produces an ordered list of if-then rules, rather than an unordered set of if-then rules, but also is possible to produce the last one changing the evaluation function. The rules induced by CN2 are of the form: "if complex then predict class" where complex is a conjunction of attribute tests. In order to carry out a classification of new examples CN2 applies an interpretation in which each rule is tried in order until one is found whose conditions are satisfied by the example that is wanted to be classified. The resulting class prediction of this rule is then assigned as the class of that example. If no induced rules are satisfied, the final default rule assigns the most commonly occurring class in the training data for all new examples. The learning algorithm of CN2 works in an iterative fashion, by means of searching in each iteration for a complex covering a large number of examples of a single class C and few of other classes. Having found a good complex, those examples it covers are removed from the training set and the rule "if complex then predict C" is added to the end of the rule list. This process iterates until no more satisfactory complexes can be found. In this project The CN2 Induction Algorithm developed by The Turing Institute has been used with the option to produce ordered list of if-then rules (The CN2 Induction Algorithm). 3.1.4 CLASSIFICATION TREES The Classification and Regression Trees (CART) software was developed by Breiman et al. (1984). CART is used for classification or regression analysis of large, complex data sets containing many variables, and is used to construct both binary prediction and classification trees by recursive partitioning (or targeted stratification). CART can construct a binary prediction tree with a structure which is easy to understand, interpret, and use. Tree construction starts by using computationally intensive algorithm that recursively searches over all the variables to produce a sequence of optimal binary splits, so that an extremely large tree is grown. A pruning algorithm is then applied which cuts off all branches of the tree that impair the overall accuracy. The result is a parsimonious decision tree: the simplest tree that gives the maximum accuracy. The tree can be inspected and used to modify the staging algorithm. The procedure is repeated until the best set is found. Cross-validation can be performed a number of times with each analysis. An introduction to Classification Trees paradigm which is a popular representation of classifiers. The interior nodes of a classification tree are tests applied to instances during classification. Branches from an interior node correspond to the possible test outcomes. Classification begins with the application of the root node test, its outcome determining the Page - 12 -
  • 13. branch to a succeeding node. The process is recursively applied until a leaf node is reached. Then the instance is labeled with the class of the leaf node, and the process halts. The trees are constructed beginning with the root of the tree and proceeding down to its leaves. The family's palindrome name (TDIDT) emphasizes that its members carry out the top- down induction of decision trees. The patriarch of this family is Hunt's concept learning system (CLS) framework, (Hunt et al. (1966)). ID3 (Quinlan (1979) and Quinlan (1986)), one of the most famous TDIDT was developed from CLS, and used a selection measure based on the computation of an information gain for each variable, and the variable that maximizes this gain is selected. A notable disadvantage of this measure is that it is biased towards selecting variables with many different values. ACLS (Patterson et al. (1983)) and ASSISTANT (Cestnik et al. (1987)), acknowledges ID3 Quinlan (1979) as its direct ancestor. ASSISTANT has been used in several medical domains with promising results. In our example the statistical software SPAD.S has been used. In it the approximation proposed by Breiman et al. (1984) has been implemented. The procedure used for classification tree’s construction was DISAR. This procedure constructs complete binary decision trees for discrimination of 2 groups, categorized by modalities of one nominal variable. 3.1.5 BAYESIAN NETWORKS While the first attempts to building Expert Systems regarded probabilities as an underlying formalism, the large number of parameters to estimate - 2 n 1 for the case of n dichotomic variables - forced the researches to reject this massive approach and instead adopt probabilistic models based on the hypothesis of independence among variables. These models presented the advantage of their simplicity but they were incapable of giving good solutions in problems with a high degree of interdependence among variables. As a consequence of these extreme positions the probability was not taken into account in the building of Expert Systems until the late 80s when Lauritzen and Spiegelhalter (1988) developed an algorithm for the propagation of evidence. This algorithm allowed probabilistic reasoning within graphical models which represented conditional independence among the variables of the system. Excellent introductions to Bayesian Network paradigm can be found in Castillo et al. (1997) and Jensen (1996). One possible classification of the structural learning methods can be carried out by grouping them into two main groups. (i) Methods based on hypothesis tests which check the assumption of conditional independence among variables and create a Bayesian Network structure which represents the obtained independence. (ii) Methods based on score and search, in which a measure of the goodness of fit (likelihood, entropy, percentage of well-classified) of a particular structure is defined as well as, a searching procedure over the space of all possible structures of Bayesian Networks. This searching procedure usually works like a greedy algorithm. In our case we used Genetic Algorithms as an intelligent metaheuristic in the searching process. Genetic Algorithms (Goldberg (1989)) use a direct analogy with the natural behavior. They work with a population of individuals, each individual representing a feasible solution to Page - 13 -
  • 14. a given problem. To each individual we assign a value or score according to the goodness of that solution represented by that individual. The better the adaptation of the individual to the problem, the more probable is that the individual will be selected for reproduction, crossing its genetic material with other individual selected in the same way. This crossover will produce new individuals - offspring of the previous - which share some of the features of their parents. In this way a new population of feasible solutions is produced, replacing the previous one and verifying the interesting property of having a greater proportion of good features than the previous population. Thus, through these generations, good features are propagated through the population. Favoring the cross of the fittest individuals the most promising areas of the search space will be explored. Figure 5 summarizes the pseudo-code for the so-called Abstract Genetic Algorithm. In it, the parent selection doesn't need to be made by assigning to each individual a value proportional to its objective function, as is usual in the so-called Simple Genetic Algorithm. This selection can be carried out by any function that selects parents in a natural way. It is worth noticing that descendants are not necessarily the next generation of individuals, but that this generation is made up of the union of parents and descendants. That is why the operations of extension and reduction in the cycle are required. begin AGA Make initial population at random WHILE NOT stop DO BEGIN Select parents from the population Produce children from the selected parents Mutate the individuals Extend the population by adding the children to it Reduce the extended population END Output the best individual found end AGA Figure 5. The pseudo-code of the Abstract Genetic Algorithm The individuals of the Genetic Algorithm will be Bayesian Network structures. A Bayesian Network structure, for a fixed domain with n variables, can be represented by a n x n connectivity matrix C, where its elements, cij , can be defined as: 1 if j is a parent of , cij 0 otherwise. In this genetic approach, an individual of the population is represented by the string: c11c21cn1c12 c22 cn2 c1n c2n cnn As can be seen, in the case that a total order among the variables is assumed, Larrañaga et al. (1996a), the usual genetic operators are closed operators with respect to the DAG conditions. In the more general case in which there is no assumption of order among the variables, Larrañaga et al. (1996b), the usual genetic operators are not closed operators and to assume the closeness a repair operator is introduced. The objective of this repair operator is to Page - 14 -
  • 15. transform the children structure that do not verify the DAG conditions into DAGs, by randomly eliminating the edges that invalidate the DAG conditions. This approach, as can be seen in the previous reference has yielded good results in the reconstruction of the ALARM net. Although, in the corporate failure problem the cardinal of the searching space is not too large, the previous approach has been used. See Larrañaga et al. (1996c) in order to consult another approach about the problem of structural learning of Bayesian Network in which an individual is a cyclic permutation of n considered variables, and the Genetic Algorithm uses crossover and mutation operators developed for the Traveling Salesman Problem. Genetic Algorithms have been used as optimizing means in other combinatorial problems that arise from the Bayesian Networks context. Thus, for example, in Larrañaga et al. (1997) they obtain good decomposition of the moral graph associated with the propagation algorithm proposed by Lauritzen and Spiegelhalter. Finally in Larrañaga et al. (1996d) the problem of the fusion of Bayesian Network proposed by different authors, seeking the consensual BN is handled. Taking into account that in a Bayesian Network any variable is only influenced by its Markov Blanket, that is, its parent, children variables and the parent variables of his children variables, it seems to be intuitive to search in the set of structures that are Markov Blanket of the special variable (Sierra and Larrañaga (1997)). Now, individuals in the Genetic Algorithms are Markov Blankets for the variable to be classified. One operator has been introduced that guarantees that the obtained children comply with a Markov Blanket of the variable to be classified. On the other hand the score used to search for the best Markov Blanket uses the percentage of well-classified individuals obtained by applying the evidence propagation feature of the HUGIN software, Andersen et al. (1989). 3.2 Methodologies - Combining Classifiers Recently in the area of Artificial Intelligence the concept of combining classifiers is proposed as a new direction for the improvement of the performance of individual classifiers. These classifiers could be based on a variety of classification methodologies, and could achieve different rate of correctly classified individuals. The goal of classification result integration algorithms is to generate more certain, precise and accurate system results. As explained in Xu et al. (1992), there are a couple of problems in the process of classifiers combination to be solved. First of all there is a problem of “how many classifiers to choose? and what kind of classifiers should they be?”. The second problem is to determinate the adequate technique to use to combine the results of chosen classifiers so that a better result could be obtained. In this work five individual classifiers are chosen coming from two different areas, Statistics and Machine Learning. To tackle the second problem a large variety of combining techniques was consulted, and finally two of the several approaches proposed in Xu et al. (1992), were chosen. These techniques are The Voting by Majority Principle and The Bayesian Formalism. Decision to use this combining approaches has been influenced by their generality. They allow to integrate different classifiers based on different methodologies, which at least can supply the output information at the abstract level - class j -, what is the only requirement these techniques have. Page - 15 -
  • 16. The following are the two techniques for the combination of individual classifiers used. The combination is based on the output information of the abstract level. The representation that will be used to explain their characteristics is the following: “Given K individual classifiers, ek k=1,…K, each of which assigns to the input X a label j k , i.e. produces an event ek (X) = j k , the problem is to use these events to build an integrated classifier E, which gives X one definitive label j, i.e. E(X) = j, and j {M+1}, where = {1,2,…M}, representing a set of specified patterns called class and {M+1}denotes that e has no idea which class X comes from, or in other words, X is rejected by e.” (e.g. M=2 for corporate failure problem, and K=5 in this work - DA, LR, CN2, CART, and BAYES ). 3.2.1 VOTING BY MAJORITY PRINCIPLE As indicated before, the problem is to produce a new event E(X) = j from the given events ek (X) = j k , k=1,…K, where the following equation may not necessarily hold: e1 ( X ) e2 ( X ) ... eK ( X ). That is, conflicts may exist among the decisions of K classifiers. A simple and common rule used for solving this kind of conflicts in human social life is Voting By Majority Principle. In it if the majority of the K classifiers vote for the same label then the combining classifier takes this label as the final result, where majority means that more than a half of the classifiers ek vote for the same label. The most general majority voting rule is the following: K j, if TE ( X j) maxi TE ( X i) E( X ) 2 M 1, otherwise K where TE ( X i) Tk ( X i ), i 1,... M k 1 1, if ek ( X ) i and i and Tk ( X i) 0, otherwise that is, the combined classifier E decides that X comes from class j, if the majority of the K classifiers decide that X comes from class j, otherwise it rejects X. It is easy to see that classifiers that reject X have no impact on the combined E, unless the majority of them reject X or if the number K is even and the half of them vote for one class and half for another class. In this project the individual classifiers are five and none of them has the option to decide that class of individual X is unknown, so there won’t be any situation where integrated classifier E won’t know which class to assign to X or in other words there won’t be any X rejected by E. 3.2.2 THE BAYESIAN FORMALISM In the previous voting method that combine the results of individual classifiers, it is only based on the label outputted by each classifier (i.e., the event ek ( X ) j k ). Each of ek (X) Page - 16 -
  • 17. = j k ’s is equally treated as one vote without considering the error of each ek itself. This method will take these errors into consideration, representing each classifier’s error by a confusion matrix that is given by: n11 ) (k n12 ) (k ... n1( M) k n21 ) (k n22 ) (k ... n2kM ( ) PTk for k=1,2…K (1) : : ... : n (M 1 k) n (M 2 k) ... n (MM k) where each row i corresponds to class i and each column j corresponds to the event ek ( X ) j . Thus, an element nijk ) denotes that nijk ) samples of class i have been assigned a ( ( label j by ek . It follows from (1) that the size of the sample is: M M (k ) N nijk ) ( (2) i 1 j 1 in which the number of cases in each class i is: M ni(•k ) nijk ) , i 1,... M (3) ( j 1 and the number of cases that are assigned class j by ek is: M n •( k ) j nijk ) , ( j 1,... M . (4) i 1 For an event ek ( X ) j of an error-bearing classifier ek , its truth (i.e., X comes from class j) has uncertainty. With the knowledge of its confusion matrix PTk , such an uncertainty could be described by the conditional probabilities that propositions X i, i=1,…M are true under the occurrence of the event ek ( X ) j , that is: nijk ) ( nijk ) ( P( X i / ek ( X ) j) . (5) n•( jk ) M (k ) nij i 1 From another point of view , the confusion matrix PTk could be regarded as the prior knowledge of an expert. The expert can express his belief for the event ek ( X ) j , by a real number bel(.) , called belief value. The higher the bel(.) he gives to a proposition, the more likely it is true. With the knowledge of PTk , bel(.) for proposition X i is given in the form of a conditional probability, given by (5), represented with the following equation: nijk ) ( nijk ) ( bel ( X i / ek ( X ) j) P( X i / ek ( X ) j) . (6) n•( k ) j M (k ) n ij i 1 Page - 17 -
  • 18. With K classifiers e1 , e2 ,..., eK ,there will be K matrices PT1 ,..., PTK . When these classifiers are used on the same input X, K events ek (X) = j k , k=1,…K will happen. Each ek (X) = j k and its corresponding PTk could supply a set of: bel ( X i / ek ( X ) j ) , i=1,…M. Now, the question is how to integrate these individual supports to give the combined value bel (i ) . Xu et al. (1992) deduce that bel (i ) could be calculated as follows: K bel (i ) P( X i / e (X) j ) (7) k k k 1 M with as an constant that ensures that bel (i ) 1 (since x i, i=1,…M are mutually i 1 exclusive and exhaustive). That is: 1 M K P(X i / e (X) j ). (8) k k i 1k 1 Finally, depending on these bel (i ) values, integrated classifier E can classify X into a class according to the decision rule given by: j, if bel ( j ) max bel (i ) E( X ) i M 1, otherwise that is, E decides that X comes from class j, if the belief value bel ( j ) is the greater than any other belief bel (i ) i=1,…M, otherwise it rejects X or in other words E has no idea about which class X comes from. 4. Results In this section the results of applying the methods described in the previous section to the problem of corporate failure are explained in full detail. The models for each one of five paradigms used in this project, and for a period of time before failure (1 year, 2 years and 3 years) are shown in this section. The results are represented with the percentage of well- classified companies for each paradigm and period. Discriminant Analysis (DA) Experiments with this paradigm were carried out using the SPSS software. The following are the classifiers, obtained with a sample of 120 companies, for each one of the 3 consecutive years before the failure, using DA. Three years before the failure S(F) = -2.0081765 + X4*2.2476778 + X5*2.529652 Page - 18 -
  • 19. Scores for the centers of groups are: - class 1 = -0.33467 - class 2 = 0.33467 X4= Earnings before interests and taxes / financial charges X5= Own funds / Total debt S(F): Score of Failure where: “if S(F) > 0 then the corporate is classified as failed, on contrary as non-failed” Two years before the failure S(F) = -1.4674983 + X2*2.3656861 + X3*-8.1494602 + X5*3.1317845 Scores for the centers of groups are: - class 1 = -0.63454 - class 2 = 0.63454 X2= Current assets / Total assets X3= Net result / Total assets X5= Own funds / Total debt X7= Stocks / Sales S(F): Score of Failure where: “if S(F) > 0 then the corporate is classified as failed, on contrary as non-failed” One year before the failure S(F) = -0.5673571 + X3*9.364902 + X5*0.5510232 Scores for the centers of groups are: - class 1 = -0.75698 - class 2 = 0.75698 X3= Net result / Total assets X5= Own funds / Total debt S(F): Score of Failure where: “if S(F) > 0 then the corporate is classified as failed, on contrary as non-failed” Logistic Regression (LR) Page - 19 -
  • 20. Experiments with this paradigm were carried out using the SAS software. The following are the classifiers, obtained with a sample of 120 companies, for each one of the 3 consecutive years before the failure, using LR. The following probability function: g( X ) e P(F) = g( X ) 1 e is the same for each one of the three models. where: P(F) is The Failure Probability and “if P(F) > 0.5 then the corporate is classified as failed, on contrary as non-failed” Three years before the failure g(X) = -1.3287 + X4*1.4877 + X5*1.6896 X4= Earnings before interests and taxes / financial charges X5= Own funds / Total debt Two years before the failure g(X) = -0.4999 + X3*-13.5892 + X5*3.8058 X3= Net Result / Total Assets X5= Own funds / Total debt One year before the failure g(X) = -1.2854 + X3*26.1304 + X5*1.3535 X3= Net Result / Total Assets X5= Own funds / Total debt Rule Induction (CN2) Experiments with this paradigm were carried out using the CN2 algorithm. The following are classifiers obtained with a sample of 120 companies as a training set for each one of the 3 consecutive years before the failure, using CN2 algorithm. Models are represented with ordered if-then rules lists, as it can be seen in Figure 10, 11, and 12. Three years before the failure Page - 20 -
  • 21. if ((x4 < 0.70) && (x8 < 0.11)) then X=NON-FAILED else if (( x2 > 0.52 ) && (x3 > 0.01) && ( x5 > 0.68 ) && (x9 > 0.04)) then X=FAILED else if (( x2 > 0.32 ) && ( x4 > 0.76 ) && ( x9 < 0.15 )) then X=FAILED else if (( x1 > 0.06 ) && ( x2 < 0.51 ) && ( x8 > 0.17 )) then X=NON-FAILED else if (( x1 > 0.16 ) && ( x5 < 0.56 ) && ( x7 < 0.09 ) && (x8 > 0.19)) then X=FAILED else if (( x2 > 0.88 ) && ( x6 > 0.22 )) then X=NON-FAILED else if (( x2 < 0.87 ) && ( x6 < 0.19 ) && ( x7 > 0.14 )) then X=NON-FAILED else if (( x4 < 0.39 ) && ( x6 < 0.26 )) then X=FAILED else if (( x1 < 0.98 ) && ( x6 < 0.41 )) then X=NON-FAILED else if (( x2 > 0.74 ) && ( x3 > 0.02)) then X=FAILED else if ((x4 < 0.61) && (x9 > 0.19)) then X=NON-FAILED else if ((x3 < 0.25) && ( x9 > 0.08 )) then X=FAILED else if (( x1 < 0.90 ) && ( x4 > 0.08) && ( x7 > 0.05)) then X=NON-FAILED else X=FAILED Figure 10. Rule Induction model for three years before the failure Two years before the failure if ((x4 > 0.17) && (x5 > 0.50) && (x8 > 0.19)) then X=FAILED else if (( x3 > 0.09 ) && ( x8 > 0.20 )) then X=NON-FAILED else if (( x1 > 0.06 ) && ( x7 > 0.30 ) && ( x7 < 0.44 )) then X=NON-FAILED else if (( x1 > 0.02 ) && ( x6 > 0.75 ) && ( x9 < 0.19 )) then X=FAILED else if (( x1 > 0.25 ) && ( x4 > 0.06 ) && ( x9 < 0.05 )) then X=NON-FAILED else if (( x1 > 0.11 ) && ( x2 > 0.74 ) && ( x9 < 0.24 )) then X=FAILED else if (( x4 > 0.08 ) && ( x7 > 0.22 ) && ( x7 < 0.34 )) then X=NON-FAILED else if (( x1 < 0.72 ) && ( x2 > 0.48 ) && ( x2 < 0.71 ) && ( x3 < 0.10 )) then X=FAILED else if (( x3 < 0.10 ) && ( x6 > 0.10 )) then X=NON-FAILED else if ( x2 > 0.69 ) then X=FAILED else X=NON-FAILED Figure 11. Rule Induction model for two years before the failure One year before the failure if (x4 < 0.80) then X=NON-FAILED else if (( x1 > 1.41 ) && ( x4 > 0.87 ) && ( x7 < 0.36)) then X=FAILED else if (( x4 > 0.95 ) && ( x5 < 0.22 )) then X=NON-FAILED else if (( x5 < 0.36 ) && ( x8 > 0.03 )) then X=FAILED else if (( x4 < 1.90 ) && ( x6 > 4.69 )) then X=NON-FAILED Page - 21 -
  • 22. else if (( x1 > 0.81 ) && ( x1 < 1.38 ) && ( x7 < 0.23 )) then X=FAILED else if (( x8 > 0.17 ) && ( x7 < 0.67 )) then X=NON-FAILED else X=FAILED Figure 12. Rule Induction model for one year before the failure The failure probability is always 100%, whether the corporate is classified as failed or as non-failed. For The Classification Trees and Bayesian Networks paradigms, variables has been categorized into three categories. The following is the explanation of how this process has been carried out for each one of the 3 years. For each group of 60 healthy companies and 60 of failed companies, and for each one of the nine ratio-variables means were calculated, in order to replace the missing values that the original sample contained. Later, for each ratio-variable, using FREQUENCIES and DESCRIPTIVES procedures from the SPSS software, their distributions were analyzed. And finally, the values of each ratio-variable were separated in three categories - 1, 2, and 3 -, according to their cumulative percentages. The next figure shows the process of categorization. *******************"3 years before the failure"************************ x1 (lowest through 0.30=1) (0.31 through 0.66=2) (0.67 through highest=3) x2 (lowest through 0.63=1) (0.64 through 0.77=2) (0.78 through highest=3) x3 (lowest through 0.02=1) (0.03 through 0.06=2) (0.07 through highest=3) x4 (lowest through 0.21=1) (0.22 through 0.61=2) (0.62 through highest=3) x5 (lowest through 0.24=1) (0.25 through 0.51=2) (0.52 through highest=3) x6 (lowest through 0.32=1) (0.33 through 0.68=2) (0.69 through highest=3) x7 (lowest through 0.11=1) (0.12 through 0.22=2) (0.23 through highest=3) x8 (lowest through 0.21=1) (0.22 through 0.32=2) (0.33 through highest=3) x9 (lowest through 0.07=1) (0.08 through 0.16=2) (0.17 through highest=3) ******************"2 years before the failure"************************ x1 (lowest through 0.26=1) (0.27 through 0.62=2) (0.63 through highest=3) x2 (lowest through 0.61=1) (0.62 through 0.74=2) (0.75 through highest=3) x3 (lowest through 0.02=1) (0.03 through 0.06=2) (0.07 through highest=3) x4 (lowest through 0.24=1) (0.25 through 0.49=2) (0.50 through highest=3) x5 (lowest through 0.20=1) (0.21 through 0.43=2) (0.44 through highest=3) x6 (lowest through 0.31=1) (0.32 through 0.59=2) (0.60 through highest=3) x7 (lowest through 0.13=1) (0.14 through 0.24=2) (0.25 through highest=3) Page - 22 -
  • 23. x8 (lowest through 0.24=1) (0.25 through 0.36=2) (0.37 through highest=3) x9 (lowest through 0.05=1) (0.06 through 0.12=2) (0.13 through highest=3) *******************"1 year before the failure"*********************** x1 (lowest through 1.05=1) (1.06 through 1.46=2) (1.47 through highest=3) x2 (lowest through 0.63=1) (0.64 through 0.77=2) (0.78 through highest=3) x3 (lowest through -0.01=1) (0.0 through 0.04=2) (0.05 through highest=3) x4 (lowest through 0.89=1) (0.90 through 1.89=2) (1.90 through highest=3) x5 (lowest through 0.32=1) (0.33 through 0.85=2) (0.86 through highest=3) x6 (lowest through 2.89=1) (2.90 through 5.89=2) (5.90 through highest=3) x7 (lowest through 0.11=1) (0.12 through 0.22=2) (0.23 through highest=3) x8 (lowest through 0.21=1) (0.22 through 0.32=2) (0.33 through highest=3) x9 (lowest through -0.05=1) (-0.04 through 0.05=2) (0.06 through highest=3) Figure 6. The process of categorization of ratio variables Classification Trees (CART) To carry out experiments with this paradigm the SPAD.N and the SPAD.S software have been used. The following are the classifiers obtained with a sample of 120 companies as a training set, for each one of the 3 consecutive years before the failure, using the procedure DISAR. The models that represent these classifiers are the binary classification trees shown in Figures 7, 8 and 9. Three years before the failure Node 1 Size = 120 H=60 and F=60 X7 /= 1 X 7= 1 Node 2 Node 3 Size = 84 Size = 36 Page - 23 -
  • 24. H=48 and F=36 H=12 and F=24 X5 = 3 X5 /= 3 X4 = 3 X4 /= 3 Node 4 Node 5 Node 6 Node 7 Size = 26 Size = 58 Size = 13 Size = 23 H=11 and F=15 H=37 and F=21 H=2 and F=11 H=10 and F=13 FINAL NODE X1 /= 1 X1 = 1 X1 /= 2 X1 = 2 X2 = 3 X2 /= 3 Node 8 Node 9 Node 10 Node 11 Node 14 Node 15 Size = 16 Size = 10 Size = 43 Size = 15 Size = 4 Size = 19 H=4 and F=12 H=7 and F=3 H=4 and F=12 H=12 and F=3 H=0 and F=4 H=10 and F=9 FINAL NODE FINAL NODE X6 = 3 X6/= 3 X2 /= 1 X2 = 1 X5 = 2 X5/= 2 Node 16 Node 17 Node 18 Node 19 Node 20 Node 21 Size = 7 Size = 9 Size = 6 Size = 4 Size = 21 Size = 22 H=0 and F=7 H=4 and F=5 H=3 and F=3 H=4 and F=0 H=9 and F=12 H=16 and F=6 FINAL NODE FINAL NODE X8 /= 2 X8 = 2 X8 /= 1 X8 = 1 X7 /=3 X7 = 3 Node 34 Node 35 Node 36 Node 37 Node 40 Node 41 Size = 7 Size = 2 Size = 4 Size = 2 Size = 9 Size = 12 H=2 and F=5 H=2 and F=0 H=3 and F=1 H=0 and F=2 H=1 and F=8 H=8 and F=4 FINAL NODE FINAL NODE FINAL NODE X3 /= 2 X3 = 2 X3 = 3 X3 /= 3 X3 = 3 X3 /= 3 Node 68 Node 69 Node 80 Node 81 Node 82 Node 83 Size = 4 Size = 3 Size = 1 Size = 8 Size = 2 Size = 10 H=0 and F=4 H=2 and F=1 H=1 and F=0 H=0 and F=8 H=0 and F=2 H=8 and F=2 FINAL NODE FINAL NODE FINAL NODE FINAL NODE FINAL NODE FINAL NODE Node 21 Size = 22 H=16 and F=6 X3 = 3 X 3 /= 3 Node 42 Node 43 Size = 7 Size = 15 H=7 and F= 0 H=9 and F= 6 FINAL NODE Page - 24 -
  • 25. X2 = 3 X2 /= 3 Node 86 Node 87 Size = 8 Size = 7 H=6 and F=2 H=3 and F=4 X6 /= 2 X6 = 2 X1 = 3 X /= 3 Node 172 Node 173 Node 174 Node 175 Size = 7 Size = 1 Size = 4 Size = 3 H=6 and F=1 H=0 and F=1 H=3 and F=1 H=0 and F=3 FINAL NODE FINAL NODE FINAL NODE FINAL NODE Node 15 Node 121 Size = 19 Size = 8 H=10 and F=9 H=5 and F=3 X9 /= 2 X9 = 2 X3 = 3 X3 /= 3 Node 30 Node 31 Node 242 Node 243 Size = 16 Size = 3 Size = 6 Size = 2 H=10 and F=6 H=0 and F=3 H=10 and F=6 H=2 and F=0 FINAL NODE FINAL NODE X3 /= 1 X3 = 1 X9 = 3 X9 /= 3 Node 60 Node 61 Node 484 Node 485 Size = 13 Size = 3 Size = 4 Size = 2 H=7 and F=6 H=3 and F=0 H=1 and F=3 H=2 and F=0 FINAL NODE FINAL NODE FINAL NODE X3 /= 1 X3 = 1 Node 120 Node 121 Size = 5 Size = 8 H=2 and F=3 H=5 and F=3 FINAL NODE Figure 7. Classification Tree model for three years before the failure Two years before the failure Node 1 Size = 120 H=60 and F=60 X5 = 3 X 5 /= 3 Node 2 Node 3 Size = 41 Size = 79 H=11 and F=30 H=49 and F=30 X7 /= 1 X7 = 1 X3 = 3 X3 /= 3 Page - 25 -
  • 26. Node 4 Node 5 Node 6 Node 7 Size = 25 Size = 16 Size = 27 Size = 52 H=11 and F=14 H=0 and F=16 H=23 and F=4 H=26 and F=26 FINAL NODE FINAL NODE X4 /= 1 X4 = 1 X9 /= 1 X9 = 1 Node 8 Node 9 Node 14 Node 15 Size = 16 Size = 9 Size = 37 S ize = 15 H=4 and F=12 H=7 and F=2 H=15 and F=22 H=11 and F=4 FINAL NODE X6 = 3 X6/= 3 X2 = 3 X2 /= 3 X4 /= 2 X4 = 2 Node 16 Node 17 Node 18 Node 19 Node 28 Node 29 Size = 10 Size = 6 Size = 3 Size = 6 Size = 27 Size = 10 H=4 and F=6 H=0 and F=6 H=1 and F=2 H=6 and F=0 H=14 and F=13 H=1 and F=9 FINAL NODE FINAL NODE FINAL NODE FINAL NODE X1 /= 1 X1 = 1 X5 = 2 X5 /=2 Node 32 Node 33 Node 56 Node 57 Size = 8 Size = 2 Size = 14 Size = 13 H=2 and F=6 H=2 and F=0 H=5 and F= 9 H=9 and F=4 FINAL NODE FINAL NODE FINAL NODE X9 = 3 X9 /=3 Node 112 Node 113 Size = 8 Size = 6 H=5 and F=3 H=0 and F=6 FINAL NODE X3 = 2 X3 /=2 Node 224 Node 225 Size = 4 Size = 4 H=1 and F=3 H=4 and F=0 FINAL NODE FINAL NODE Figure 8. Classification Tree model for two years before the failure One year before the failure Node 1 Size = 120 H=60 and F=60 X4 = 3 X 4 /= 3 Node 2 Node 3 Size = 40 Size = 80 H=3 and F=37 H=57 and F=23 X7 = 3 X7/= 3 X3 /= 1 X3 = 1 Page - 26 -
  • 27. Node 4 Node 5 Node 6 Node 7 Size = 8 Size = 32 Size = 40 Size = 40 H=2 and F=6 H=1 and F=31 H=22 and F=18 H=35 and F=5 FINAL NODE FINAL NODE X4 /= 1 X4 = 1 X2 /= 2 X2 = 2 Node 8 Node 9 Node 12 Node 13 Size = 7 Size = 1 Size = 27 Size = 13 H=1 and F=6 H=1 and F=0 H=19 and F=8 H=3 and F=10 FINAL NODE FINAL NODE X1 /= 1 X1 = 1 X9 /= 1 X9 = 1 Node 24 Node 25 Node 26 Node 27 Size = 16 Size = 10 Size = 8 Size = 5 H=8 and F=8 H=11 and F=0 H=0 and F=8 H=3 and F=2 FINAL NODE FINAL NODE FINAL NODE X8 /= 1 X8 =1 Node 48 Node 49 Size = 10 Size = 6 H=8 and F= 2 H=0 and F=6 FINAL NODE X9 = 3 X9 /=3 Node 96 Node 97 Size = 2 Size = 8 H=0 and F=2 H=8 and F=0 FINAL NODE FINAL NODE Figure 9. Classification Tree model for one year before the failure Where: F : represents class-failed (Failed) H : represents class-non-failed (Healthy) FINAL NODE: is a leaf node where the process ends, and where letter F or H represents that for individual that comes at this node the class assigned is failed or non-failed respectively. H The failure probability is calculated with in the case when corporate is H F F classified as non-failed, reaching the leaf node, or with when classified as failed. H F Bayesian Networks Page - 27 -
  • 28. Experiments with Bayesian Networks were carried out using the HUGIN software, Genetic Algorithm, and Markov Blanket. Models were obtained with a sample of 120 companies as a training set for each one of the 3 years prior to failure are shown in Figure 13, 14 and 15. Initially, P(X) = 0.50, which means that there is the same probability that the corporate will fail as that it will not. After giving real values to the nine variables it’s destiny is predicted on the following way: - for three years before the failure: “if P(X) > 0.30 then the corporate is classified as non-failed, on contrary as failed” - for two years before the failure: “if P(X) > 0.40 then the corporate is classified as non-failed, on contrary as failed” - for one year before the failure: “if P(X) > 0.45 then the corporate is classified as non-failed, on contrary as failed” Three years before the failure X9 = OCF º / TOTAL ASSETS X8 = DEBTORS / SALES X9 = 1 29 % X8 = 1 33 % X9 = 2 32 % X8 = 2 31 % X9 = 3 39 % X8 = 3 36 % X = NON-FAILED / FAILED X = 1 - P(X) = 50 % Page - 28 -
  • 29. X = 2 - P(X) = 50 % X1 = 1 - 33 % X1 = 2 - 34 % X1 = 3 - 33 % X1 = CURRENT ASSETS / CURRENT LIABILITIES X6 = 1 - 33 % X4 = 1 - 33 % X6 = 2 - 34 % X4 = 2 - 32 % X6 = 3 - 32 % X4 = 3 - 35 % X6 = SALES / OWN FUNDS X4 = EBIT / FINANC. CHARGES X5 = OWN FUNDS / TOTAL DEBT X2 = CURR. ASSETS / TOT. ASSETS X5 = 1 - 32 % X2 = 1 - 32 % X5 = 2 - 33 % X2 = 2 - 35 % X5 = 3 - 34 % X2 = 3 - 33 % X7 = STOCKS / SALES X3 = NET RESULT / TOTAL ASSETS X7 = 1 - 30 % X3 = 1 - 24 % X7 = 2 - 38 % X3 = 2 - 43 % X7 = 3 - 32 % X3 = 3 - 32 % Figure 13. Bayesian Networks model for three years before the failure Two years before the failure X7 = 1 30 % X7 = STOCKS / SALES X7 = 2 35 % X7 = 3 35 % X = NON-FAILED / FAILED X = 1 - P(X) = 50 % X = 2 - P(X) = 50 % X2 = 1 - 33 % X2 = 2 - 33 % Page - 29 -
  • 30. X2 = 3 - 33 % X2 = CURRENT ASSETS / TOTAL ASSETS X6 = 1 - 33 % X4 = 1 - 33 % X6 = 2 - 33 % X4 = 2 - 33 % X6 = 3 - 34 % X4 = 3 - 33 % X6 = SALES / OWN FUNDS X4 = EBIT / FINANC. CHARGES X1 = 1 - 33 % X1 = 2 - 33 % X1 = 3 - 33 % X5 = OWN FUNDS / TOTAL DEBT X1 = CURR. ASSETS / CURR. LIABIL. X5 = 1 - 33 % X5 = 2 - 33 % X5 = 3 - 33 % X8 = DEBTORS / SALES X9 = OCF º / TOTAL ASSETS X8 = 1 - 34 % X9 = 1 - 34 % X8 = 2 - 32 % X9 = 2 - 31 % X8 = 3 - 34 % X9 = 3 - 35 % X3 = NET RESULT / TOTAL ASSETS X3 = 1 - 30 % X3 = 2 - 34 % X3 = 3 - 36 % Figure 14. Bayesian Networks model for two years before the failure One year before the failure X = NON-FAILED / FAILED X = 1 - P(X) = 50 % X = 2 - P(X) = 50 % X1 = CURR. ASSETS / CURR. LIABILITIES X7 = STOCKS / SALES X1 = 1 - 35 % X4 = 1 - 31 % X1 = 2 - 31 % X4 = 2 - 32 % X1 = 3 - 34 % X4 = 3 - 37 % Page - 30 -
  • 31. X3 = NET RES. / TOT. ASS. X3 = 1 - 33 % X8 = DEBTORS / SALES X3 = 1 - 38 % X3 = 1 - 29 % X8 = 1 - 34 % X6 = SALES / OWN FUNDS X8 = 1 - 32 % X8 = 1 - 34 % X6 = 1 - 33 % X6 = 1 - 33 % X6 = 1 - 33 % X2 = CURR. ASSETS / TOT. ASSETS X2 = 1 - 33 % X9 = OCF º / TOTAL ASSETS X2 = 1 - 33 % X2 = 1 - 34 % X9 = 1 - 35 % X9 = 1 - 28 % X4 = EBIT / FIN. CHARGES X9 = 1 - 38 % X4 = 1 - 32 % X4 = 1 - 34 % X4 = 1 - 33 % X5 = OWN FUNDS / TOTAL DEBT X5 = 1 - 33 % X5 = 2 - 33 % X5 = 3 - 33 % Figure 15. Bayesian Networks model for one year before the failure Results obtained with the aforementioned models are resumed in the next two tables with the percentage of well classified companies for each paradigm and period of time (1 year, 2 years, and 3 years before the failure). Table 1 summarize these results regarding the goodness of fit of each one of paradigms calculated by the 5-fold cross-validation method, explained in Section 2. PARADIGMS Years before DA LR CART CN2 BAYESIAN the failure NETWORKS 1 78.33 82.50 79.17 80.0 60.83 2 69.17 69.16 60.00 66.66 62.00 3 55.00 55.00 45.00 57.50 60.83 Table 1. Results regarding the 5-fold cross-validation As showed in Table 1, the more you go back in time the worse the results are. Though in Bayesian Networks they remain more or less constants in time. These results can be Page - 31 -
  • 32. compared with the ones shown in Table 2. In it can be seen the goodness-of-fit of each paradigm calculated with the second validation method proposed in Section 2. PARADIGMS Years before DA LR CART CN2 BAYESIAN the failure NETWORKS 1 81.00 82.5 89.00 95.17 91.17 2 72.17 71.51 84.50 91.17 92.00 3 59.84 60.33 76.67 88.00 92.17 Table 2. Results regarding the second validation In this case it is worth mentioning the good behavior of paradigms such as CN2 and Bayesian Networks as well as the Classification Trees. The goal of any algorithm for integration of classification results is to generate more certain, precise and accurate results. Two experiments were conducted in order to compare the performance of the aforementioned methods and combining techniques. The following two tables summarize the results got applying combining methods - Voting By Majority Principle and Bayesian Formalism -, to classification results when 5-fold cross-validation is used (explained in Section 2). PARADIGMS Years before DA LR CART CN2 BAYESIAN VOTING the failure NETWORKS BY MAYORITY 1 78.33 82.50 79.17 80.0 60.83 88.33 2 69.17 69.16 60.00 66.66 62.00 79.17 3 55.00 55.00 45.00 57.50 60.83 73.33 Table 3. Results regarding the 5-fold cross-validation and Voting By Majority Principle As expected Voting By Majority Principle has succeeded in improving classification results, being better than the best individual classifier for any of the 3 years. These pretty good results can be compared with the ones obtained applying Bayesian Formalism combining method, whose results are shown in Table 4. PARADIGMS Years before DA LR CART CN2 BAYESIAN BAYESIAN the failure NETWORKS FORMALISM 1 78.33 82.50 79.17 80.0 60.83 73.33 2 69.17 69.16 60.00 66.66 62.00 75.83 3 55.00 55.00 45.00 57.50 60.83 62.50 Table 4. Results regarding the 5-fold cross-validation and Bayesian Formalism Page - 32 -
  • 33. As can be noticed, Bayesian Formalism presents better results that the best individual classifier only for 3 and 2 years before the failure, but in the case of 1 year it is, surprisingly, only better than the worst classifier. It is interesting to mention, that when the second validation method is used to produce classification results for individual classifiers these two combining methods reach 100% in classification, for each one of the 3 years. It is too good to be truth, but still this situation could be understood as, when more classifiers we have (in this case there are 25 models, 5 for each paradigm) the more the possibility that the majority of them vote for the real class and that Bayesian Formalism reaches 100% too. In the following Section, comparison of results is presented, trying to reach some conclusions about the performance of individual classifiers and their integration. 5. Conclusions and Future Work Techniques coming from Statistics and Artificial Intelligence have been applied to the problem of Corporate Failure prediction in different period of time. The discrepancy between the results obtained when training set is with 96 of 120 companies and the test set is with 120 of them, and the ones obtained when the 5-fold cross- validation suggests the existence of a problem of overfitting in the Bayesian Networks (Markov Blankets) and CN2 paradigms. It is possible that these problems could be relieved by penalizing complex structures in the learning process of these paradigms - see Akaike’s criteria, Minimum Description Length, … - and by trying to guide the search by the parsimony principle. It is interesting to see that Discriminant Analysis and Logistic Regression models select the same ratio-variables as significant ones, for all of the three years, and that the percentages of well-classified individuals are rather similar, being Logistic Regression classifier a little bit better. Observing Classification Trees it is easy to notice that, as the date of failure is closer the depth of tree is smaller, which can be understood as, the closer the date of failure is less variables are necessary to be examined, and sooner the conclusions about the “destiny” of the individuals in testing set are reached. This phenomenon is also present in ordered if-then-rule lists. Here the closer the date of failure is less rules are needed to be checked out to determine the “destiny” of the corporate. For three years before the failure there are thirteen rules in the list, ten for two years and just seven for one year. Bayesian Networks paradigm provides another aspect of interest. Observing the three networks it can be seen that the closer the date of failure is more connected nodes are, - there are more dependencies between the variables -, and the thresholds are greater. Page - 33 -
  • 34. Analyzing classification results when combining techniques are used it can be seen that they really achieve their main goal, to improve the performance of individual models. Voting By Majority Principle is quite better than Bayesian Principle, which for one year before the failure only improve the performance of the worst individual one. Another aspect which deserves further investigation is the development or implementation of procedures that would achieve even better results that these two used in this project. There are several approaches. One of them is to use another kind of information outputted by individual classifiers too, like probability that the class for individual X is the right one. Another one is to guide the performance of integrated classifiers with the goodness of each one of individual classifiers. Page - 34 -
  • 35. ACKNOWLEDGEMENTS The authors wish to thank Dr. Fermin Lizarraga for providing the original data sample and his doctoral dissertation, which was the guidance for comprehension of the corporate failure problem, and of the use of financial ratios for failure prediction. The authors acknowledge Miren Josune Gallego for helpful guidance in the use of the software packages, specially with the SAS software. The authors would also like to thank the Diputación Foral de Guipuzcoa since this study was supported under grant no. OF 131/1997. REFERENCES Altman, E.I. (1968). Financial Ratios, Discriminant Analysis and the Prediction of Business Failure. Journal of Finance, 589-609. Andersen, S.K., Olesen, K.G., Jensen, F.V. and Jensen, F. (1989). HUGIN - a shell for building Bayesian belief universes for Expert Systems. Eleventh International Joint Conference on Artificial Intelligence, vol. I, 1128-1133. Argenti, J. (1976). Corporate Collapse: the Causes and Symptoms. McGraw-Hill. London. Beaver, W. (1966). Financial Ratios as Predictors of Failure. Empirical Research in Accounting: Selected Studies. Supplement of Journal of Accounting Research, 71-111. Blum, M. (1974). Failing Company Discriminant Analysis. Journal of Accounting Research, 1-23. Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. (1984). Classification and Regression Trees. Monterey, CA: Wadswooeth and Brooks. Casey, C.J. and Baztczak, N.J. (1985). Cash-Flow: it’s not the bottom line. Harvard Business Review, 61-66. Castillo, E., Gutierrez, J.M. and Hadi, A.S. (1997). Expert Systems and Probabilistic Network Models. Springer-Verlag. Cestnik, B., Kononenko, I. and Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users, in Bratko, I. and Lavrac, N. (Eds.) Progress in Machine Learning, Sigma Press, Wilmslow. Clark, P. and Niblett, T. (1989). The CN2 Induction Algorithm, Machine Learning, 3(4), 261-283. Deakin, E.B. (1972). A Discriminant Analysis of Predictors of Business Failure. Journal of Accounting Research, 167-179. Dizdarevic, S., Lizarraga F., Larrañaga P., Sierra B. and Gallego M.J. (1997). Statistical and Machine Learning Methods in the prediction of Bankruptcy. International Meeting on Artificial Intelligence in Accounting Finances and Taxes, Huelva, Spain, 85-100. Edmister, R.O. (1972). An Empirical Test of Financial Ratio Analysis for Small Business Failure Prediction. Journal of Financial and Quantitative Analysis. Vol. 7, 1477-1493. Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179-188. Goldberg, D.E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA. Ho T.K., Hull J.J. and Srihari S.N. (1994). Decision Combination in Multiple Classifier Systems, IEEE Transactions on Pattern analysis and machine intelligence, Vol. 16. 1. January, 66-75. Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression. Wiley Series in Probability and Mathematical Statistics. Hunt, E.B., Marin, J. and Stone P.J. (1966). Experiments in Induction, Academic Press. Page - 35 -
  • 36. Jensen, F.V. (1996). Introduction to Bayesian networks. University College of London. Larrañaga, P., Murga, R., Poza, M. and Kuijpers, C. (1996a). Structure Learning of Bayesian Networks by Hybrid Genetic Algorithms. Learning from Data: AI and Statistics V, Lecture Notes in Statistics 112. D. Fisher, H.-J. Lenz (eds.), New York, NY: Spriger-Verlag, 165-174. Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R. and Kuijpers, C. (1996b). Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of Control Parameters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 912-926. Larrañaga, P., Kuijpers, C., Murga, R. and Yurramendi, Y. (1996c). Bayesian Network Structures by searching for the best ordering with genetic algorithms. IEEE Transactions on System, Man and Cybernetics. Vol. 26, no. 4, 487-492. Larrañaga, P., Kuijpers, C., Murga, R., Yurramendi, Y., Graña, M., Lozano, J.A., Albizuri, X., D'Anjou, A. and Torrealdea, F.J. (1996d). Genetic Algorithms applied to Bayesian Networks. A. Gammerman (ed.). Computational Learning and Probabilistic Reasoning. John Wiley, 211-234. Larrañaga, P., Kuijpers, C., Poza, M. and Murga, R. (1997). Decomposing Bayesian Networks by Genetic Algorithms. Statistics and Computing. No 7, 19-34. Lauritzen, S.L. (1996). Graphical models. Oxford Science Publications. Lauritzen, S.L., and Spiegelhalter, D.J. (1988). Local computations with probabilities on graphical structures and their application on Expert Systems. J.R. Statist. Soc. B, vol. 50, no. 2, 157-224. Libby, R. (1975). Accounting Ratios and the Prediction of Failure: Some Behavioral Evidence. Journal of Accounting Research, 150-161. Lizarraga, F. (1996). Modelos Multivariantes de predicción del fracaso empresarial: una aplicación a la realidad de la información contable española. Ph.D. Public University of Navarra. Mensah, Y. (1983). The Differential Bankruptcy Predictive Ability of Specific Price Level Adjustments: Some Empirical Evidence, Accounting Review, 228-246. Michie, D., Spiegelhalter, D. J. and Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood Series in Artificial Intelligence. New York. Ohlson, J.A. (1980). Financial Ratios and the Probabilistic Prediction of Bankruptcy. Journal of Accounting Research, 18, 1, 109-111. Patterson, A. and Niblett, T.(1983). ACLS user manual, Intelligent Terminals Ltd., Glasgow. Peel, M.J. and Peel, D.A. (1987). Some Further Empirical Evidence on Predicting Private Company Failure. Accounting and Business Research, 18, 69, 57-66. Platt, H.D. (1985). Why Companies Fail: Strategies for Detecting Avoiding and Profiting from Bankruptcy. Lexington Books. Massachusetts. Quinlan, J.R. (1979). Discovering rules by induction from large collection of examples, in Expert systems in the micro electronic age, Edinburgh University Press. Quinlan, J.R. (1986). Induction of Decision Trees, Machine Learning, 1(1), 81-106. SAS Institute Inc. (1993). SAS Language: Reference, Version 6, SAS Institute Inc. Scott, J. (1981). The probability of Bankruptcy: A Comparison of Empirical Predictions and Theoretical Models. Journal of Banking and Finance, 317-344. Sierra B. and Larrañaga P. (1997). Searching for the optimal Bayesian Network in classification tasks by Genetic Algorithms, WUPES 97, 144-154. SPAD.S (1994). Version 2.5 Sistema Compatible para el Análisis de Datos. SPSS Inc. (1990). SPSS Reference Guide, SPSS INC. Stone, M. (1974). Cross-validation choice and assessment of statistical procedures. Journal of Royal Statistical Society, n 36, 111-147. Taffler, R. (1982). Finding those Firms in Danger. Accountancy Age, 16. Page - 36 -
  • 37. The Turing Institute, The CN2 Induction Algorithm (1988), The Turing Institute, 36 N. Hanover St., Glasgow, GI 2AD, U.K. October. Xu L., Kryzak A. and Suen C.Y.(1992). Methods of Computing Multiple Classifiers and Their Applications to Handwriting Recognition. IEEE Transactions on Systems, Man and Cybernetics, Vol. 22. no. 3., 418-435. Zavgren, C. (1985). A Probabilistic Model of Financial Distress. Ph.D. The University of Nebraska. Page - 37 -