SOLVENCY ASSESSMENT IN AN UNBALANCED SAMPLE

Javier De Andrés, Pedro Lorca

University of Oviedo, Faculty of Economy and Business, Department of Accounting

Avda. del Cristo s/n, Oviedo 33006, Spain

Fernando Sánchez-Lasheras

University of Oviedo, Department of Construction and Manufacturing Engineering

Campus de Gijón, Edificio 5, 33204 Gijón, Spain

Francisco Javier de Cos-Juez

University of Oviedo, Department of Exploitation and Exploration of Mines, C/ Independencia, 13, Oviedo 33004, Spain

Keywords: Solvency, Unbalanced sample, Self organized maps (SOM), Multivariate adaptive regression splines

(MARS).

Abstract: This paper proposes an improved approach to the assessment of firms’ solvency. First, sound companies are

classified into clusters according to their financial similarities by using Kohonen’s Self Organizing Maps

(SOM). Then, each cluster is replaced by a director vector which summarizes all the companies that the

cluster includes. The next step is the estimation of a classification model through Multivariate Adaptive

Regression Splines (MARS). For the test of the model we considered a real setting of Spanish enterprises

from the construction sector. In this dataset the proportion of distressed firms is very close to that which is

derived from Economic statistics. Our results indicate that our system performs better than two

benchmarking models, namely a back-propagation neural network and a simple MARS model.

1 INTRODUCTION

During the last years the importance of bankruptcy

forecasting models is very high due to the current

financial crisis, which demands an even more

careful management of financial resources.

Furthermore, under Basel II Accord

recommendations (Bank for International

Settlements, 2006), banks which choose to develop

their own empirical model to quantify required

capital for credit risk (Internal Rating-Based

Approach) are required to maintain less capital than

those using the Standardized Approach.

The model applied in the present research is

considered as an hybrid system (HS). HS combine

two or more intelligent techniques in several forms

to derive the advantages of all of them. Most HS

require a considerable amount of data to reach to

accurate estimations. This is not a problem

nowadays, as there exists publicly available

databases containing financial information of listed

and unlisted firms.

However, studies using HS for bankruptcy

prediction suffer from a drawback which is that the

majority of them estimate the model upon the basis

of a sample in which non-failed companies are

underrepresented. In most cases a matched-pairs

design is used. The selection of non-failed firms is

arbitrary, which makes the model to achieve a high

in-sample percentage of correct classifications but it

is likely to be inaccurate for failure prediction in

new cases drawn from a real population.

Another strategy is to consider a “real”

population as the sample. That is, to consider all the

companies for which we have financial information

available. However, as only a very small percentage

of firms enter into financial distress in a normal

economic situation, such samples are very

unbalanced. This causes coefficient instability and

leads to poor performance ability of the models

(Foglia et al., 2001).

283

De Andrés J., Lorca P., Sánchez-Lasheras F. and De Cos-Juez F..

SOLVENCY ASSESSMENT IN AN UNBALANCED SAMPLE.

DOI: 10.5220/0003620202830286

In Proceedings of the International Conference on Knowledge Management and Information Sharing (KMIS-2011), pages 283-286

ISBN: 978-989-8425-81-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

As an alternative to both strategies we propose a

HS model where, upon the basis of a real population

of firms, data are preprocessed to summarize the

information of healthy firms. So, the initial

unbalanced sample is transformed into a balanced

one which retains the main features of the healthy

firms. Self Organized Maps (SOM) is used in this

stage. Then a classification device is developed upon

the transformed sample, for which we use the

Multivariate Adaptive Regression Splines (MARS)

approach. The results are compared with

benchmarks which are popular in bankruptcy

prediction literature. As an important application of

the combined approach, this paper applies it to the

solvency assessment of Spanish construction firms.

2 THE DATABASE AND THE

FINANCIAL RATIOS FOR

PREDICTING BANKRUPTCY

In the present research a database with Spanish

construction firms was drawn up. As bankrupt

companies we considered those whose judicial

declaration took place in 2008. In accordance with

Spanish legislation, limited liability companies are

required to deposit their annual accounts in the

Registro Mercantil. We deleted from the sample

companies that did not provide full information

about all the variables from the year prior to

bankruptcy. To avoid the distortions small

enterprises whose annual accounts are were also

deleted from the database (firms whose total assets

were below 100K €) We obtained a final data set

that was made up of 63.107 firms. Of these, a total

of 256 companies went bankrupt in 2008.

In this paper we used the five variables proposed

by E.I. Altman in his seminal paper on the

usefulness of linear discriminant analysis (Altman,

1968). Therefore, the five variables used in this

paper are the following: working capital/total assets

(X1), retained earnings/total assets (X2), earnings

before interest and taxes (EBIT)/total assets

(Altman, 1993) (X3), book value of equity/book

value of total debt (X4), and sales/total assets (X5).

3 ALGORITHM AND

ANALYTICAL PROCEDURE

3.1 The proposed Hybrid Model

The model proposed in the present research combi-

combines the use of MARS models with a clustering

technique which is SOM mapping in order to obtain

a MARS model which uses as training information

only those companies considered as representative

of each cluster. A more detailed explanation of the

steps of the algorithm is presented below.

Step 1: Study of the similarities of the bankrupt

companies by means of Mahalanobis distances. The

Mahalanobis distance is a non-euclidean distance

measure (Mahalanobis, 1936) based on correlations

between variables.

Step 2: Those bankrupted companies that were more

dissimilar to the rest of the sample were signalled as

outliers and removed from the data set to be

employed for step 3 although they were taken into

account for the training and validation of the model.

The determination of the bankrupted companies

considered as outliers was done by means of the

robust estimation of the parameters in the

Mahalanobis distance (Rousseeuw and Van

Zomeren, 1990) and the comparison with a critical

value of the Chi-square distribution (in our case the

95% quantile).

Step 3: The Mahalanobis distance of each one of the

non-bankrupt companies versus the set of all the

bankrupted companies not considered as outliers

was calculated.

Step 4: A new category of companies was created,

which was called “borderline”. The companies that

were not considered as outliers when compared with

the sample of bankrupt companies are supposed to

be more likely to go bankrupt than the rest of non-

bankrupted companies. Therefore they were

included in this new category.

Step 5: Companies belonging to non-bankrupted and

borderline populations were classified in clusters

using the SOM algorithm (Kohonen, 1995). Several

clusters of different dimensions were defined and

trained with the non-bankrupted and borderline sets.

This step is performed in order to obtain a more

balanced set of data for the training of the models in

the next steps.

Step 6: An algorithm based on MARS (Friedman,

1991) was then used to implement a different model

for each set of clusters. These models are estimated

in order to determine the number of clusters that best

represents the initial set of data. Every MARS model

was then applied to the initial set of bankrupt and

non-bankrupt companies and the performance of

each one was evaluated by means of their specificity

and sensibility (more details on this point are

provided below).

KMIS 2011 - International Conference on Knowledge Management and Information Sharing

284

Step 7: The last step of the algorithm consisted in

the training and validation of a MARS model using

the number of clusters with the best performance in

step 6.

4 RESULTS

In this section we detail the results of the algorithm,

as well as those of the benchmark techniques.

4.1 The Algorithm

First, table 3 details the number of clusters for each

model. All companies belonging to non-bankrupt

and borderline populations were classified in

clusters using SOM. The clusters were obtained as

the output of step 5 of the algorithm. As can be

observed, the minimum number of clusters used for

the models is 144. This means that the original SOM

was of

)1212( 

neurons. Please note that each

cluster is represented by a director vector. A director

vector (Perner, 2008) can be described as the

expected value for each one of the independent

variables for all the companies that belong to a

certain cluster. As it was already mentioned before,

this step was performed in order to obtain a more

balanced set of data for the training of the models in

the following steps in which each cluster was

represented by a director vector that aims to

summarize the information of all the individuals

contained in each subset. Table 1 shows the number

of clusters that were used and in which model they

were employed. Please note that all the models were

trained using the 204 bankrupted companies.

Table 1: Number of clusters used for each model.

Model name

Number of director vectors (clusters)

Non-bankrupt companies Borderline companies

M1 144 144

M2 169 169

M3 196 196

M4 225 225

M5 256 256

M6 289 289

M7 324 324

M8 361 361

M9 400 400

An algorithm based on MARS models (step 6)

was afterwards used in order to implement a

different model for each set of clusters. The

intention of these models is to determine the number

of clusters that best represents the initial set of data.

In order to reach this aim, all the models were

trained using a set which comprises (a) all the

bankrupted companies, (b) the director vectors

corresponding to non-bankrupt non-borderline

companies and (c) the director vectors

corresponding to borderline companies. The

validation was made by calculating the confusion

matrix using the information of the original

database. Table 2 shows the average percentage of

correctly classified companies in the five runs of

each model. The last column of the mentioned table

represents the total percentage of companies of the

database that were correctly classified by the model.

This is the most important parameter as it gives us

an outlook of the global performance of the model.

It is noticeable that although no maximum degree

value was imposed to the MARS models all the

models from M1 to M9 were of degree 3.

Table 2: Average percentage of companies that are

correctly classified in their corresponding category.

Model

name

% of companies correctly classified

Bankr.

Non-

Bankr.

Bord.

Non-

bankr.

+Bord.

Total

88.10

57.50 89.70 82.51 79.16

88.30

58.30 90.70 83.46 79.94

88.90

59.80 91.20 84.19 81.18

88.90

60.30 91.30 84.38 81.96

88.70

60.40 91.60 84.63 84.29

88.50

60.60 92.30 85.22 85.22

87.90

62.80 91.50 85.09 85.09

87.30

61.30 87.20 81.42 81.43

85.40

58.80 83.20 77.75 77.78

According to the results of Table 2, the model

with the highest performance was M6 although their

results were very close to M7 and that was the

reason why in the last step of the algorithm two

MARS models were validated and trained using as

input information the numbers of clusters defined by

both M6 and M7. Finally, step 7 consisted in the

training and validation of M6 and M7. We used as

input information the whole database and performed

five runs in which 80% of the information was used

for training and the other 20% for validation.

Table 3 contains a confusion matrix in which the

mean values obtained in the validation of the results

of the five different M6 MARS models are shown.

Please note that the results of model M7 are not

presented as they were slightly worst than those

obtained for M6.

SOLVENCY ASSESSMENT IN AN UNBALANCED SAMPLE

285

Table 3: Confusion matrix: average values of the

validation results of 5 different M6 MARS models trained.



Real category

Non-bankr. Bankr.

Predicted