HYBRID OPTIMIZATION TECHNIQUE FOR ARTIFICIAL NEURAL

NETWORKS DESIGN

Cleber Zanchettin and Teresa B. Ludermir

Center for Informatics, Federal University of Pernambuco

Recife, PE - P.O.Box 7851, 50.732-970, Brazil

Keywords:

Global optimization, Artiﬁcial neural networks, Relevant feature selection, Experimental design.

Abstract:

In this paper a global and local optimization method is presented. This method is based on the integration

of the heuristic Simulated Annealing, Tabu Search, Genetic Algorithms and Backpropagation. The perfor-

mance of the method is investigated in the optimization of Multi-layer Perceptron artiﬁcial neural network

architecture and weights. The heuristics perform the search in a constructive way and based on the pruning of

irrelevant connections among the network nodes. Experiments demonstrated that the method can also be used

for relevant feature selection. Experiments are performed with four classiﬁcation and one prediction datasets.

1 INTRODUCTION

This paper presents a global and local optimization

method. This method is based on the integration of

the heuristic Simulated Annealing (SA), Tabu Search

(TS), Genetic Algorithms (GA) and Backpropagation.

The performance of the method is investigated in the

simultaneous optimization of MLP architecture and

weights. The aim is to achieve topologies with few

connections and high performance in any dataset.

The proposed method present some interesting

characteristics: (1) search optimization for generating

new solutions; (2) pruning to eliminate connections

and optimize network size; and (3) the constructive

approach for ﬁnding the best network topology.

We investigate yet different cost functions for neu-

ral network training. Five cost functions will be in-

vestigated: (1) average method; (2) weighted aver-

age method; (3) weight-decay mechanism; (4) multi-

objective optimization strategy; and (5) a combination

of the weight-decay and multi-objective strategy.

Experiments demonstrated that the method can

also be used for relevant feature selection. During

the network topology optimization, input units may

be eliminated in accordance with the method perfor-

mance. Thus, a feature selection is obtained that is

inherent to the ANN optimization process.

The main conﬁguration parameters of the method

also have their inﬂuence estimated by factorial exper-

imental design, through which it is possible to deter-

mine the inﬂuence, interaction and inter-relationship

among these parameters. These analyses are impor-

tant in identifying the inﬂuence of each parameter and

possibly reduce the number of adjusts in the project of

the method.

2 THE PROPOSED METHOD

The proposed method (GaTSa) is based on the inte-

gration of the heuristic SA, TA, GA and Backprop-

agation. The pseudo-code of the proposed method

and more implementation details are presented in

(Zanchettin and Ludermir, 2006). The next subsec-

tions presenting some important method implementa-

tion details for the paper.

2.1 Cost Function

Different from the constructive algorithms, which

generate a solution only at the end of the process, the

iterative algorithms derive possible solutions in each

iteration. The cost function is used to evaluate the per-

formance of successive iterations and select a solution

that minimize (or maximize) an objective function.

Different cost functions can be used to evaluate

the quality of a solution. In this work ﬁve ways of

performing cost evaluation were investigated:

Average Method. The technique uses the arith-

metic mean between the objectives to be optimized

in the problem. For classiﬁcation problems, the cost

f (s) of the solution s is:

f (s) =

(E(P

) + ψ(C)) (1)

242

Zanchettin C. and B. Ludermir T. (2009).

HYBRID OPTIMIZATION TECHNIQUE FOR ARTIFICIAL NEURAL NETWORKS DESIGN.

In Proceedings of the 11th International Conference on Enterprise Information Systems - Artiﬁcial Intelligence and Decision Support Systems, pages

242-247

DOI: 10.5220/0002012102420247

 SciTePress

For prediction problems, the cost f (s) of the

solution s is given by the mean of the squared error

percentage (SEP) for the training set and the percent-

age of connections (ψ(C)) used by the network:

f (s) =

(SEP(P

) + ψ(C)) (2)

SEP = 100

max

− o

min

∑

p=1

∑

i=1

(φ(x)

− γ(x)

)

(3)

where o

min

and o

max

are the minimum and maximum

values of output coefﬁcients in the problem repre-

sentation (assuming these are the same for all output

nodes).

Weighted Average Method. In experiments the

network parameters, network performance, network

connection percentage and percentage of hidden

nodes are weighted by the parameters α, β and κ,

respectively. For classiﬁcation problems:

f (s) =

(E(P

) ∗ α) + (ψ(C) ∗ β) + (pN ∗ κ)

(α + β + κ)

(4)

For prediction problems:

f (s) =

(SEP(P

) ∗ α) + (ψ(C) ∗ β) + (pN ∗ κ)

(α + β + κ)

(5)

where pN is the percentage of used hidden node

connections, α = 1, β = 0.5 and κ = 0.25.

Weight Decay. Weight decay was initially proposed

as an implementation to improve the backpropagation

algorithm (BP) for the preference bias of a robust neu-

ral network that is insensitive to noise.

The weight decay mechanism is used in the

GaTSa cost function to eliminate solutions with small

weights that can be used by the artiﬁcial neural net-

work to code noise patterns and is implemented by:

f (s) =

∑

E(P

ψ(C)+

∑

i j

/1 +W

i j

(6)

where µ is the parameter for the importance of the bias

term.

Multi-objective Optimization. Most problems

usually has no unique, global solution, but has a set of

equally efﬁcient or non-inferior alternative solutions,

known as the Pareto-optimal set. Pareto-optimal

solutions consist of all solutions for which the

corresponding objective cannot be improved in any

dimension without degradation in another.

In the present work, the multi-objective strategy is

used in genetic operators to evolve the population of

solutions, considering two goals to be minimized - the

MLP size and generalization.

2.2 Feature Subset Selection

In the training and improving network weights and

connections, the method is able to eliminate input

connections of the neural network architecture. The

input processing node represents a feature of the

dataset. The input connections with the highest usage

frequency have the highest importance in the classi-

ﬁcation or prediction task and possibly inputs with

minor statistical relevance will be the ﬁrst to be re-

moved. If the method eliminate all input connections

of one input processing node, this input is eliminated

of the network architecture.

In this way, the proposed method can be used to

feature subset selection, reducing the problem dimen-

sionality and consequently the complexity of the gen-

erated artiﬁcial neural network. We use the classical

feature selection methods Hill-Climbing, Random Bit

Climber, Best-First and Las Vegas, in a wrapper way,

to evaluate the performance of the proposed method

in optimal feature subset selection.

3 METHODOLOGY

Summary of the Used Data Sets. In experiments,

four classiﬁcation simulations are performed: (1) The

odor recognition problem in artiﬁcial noses data set

(de Souza et al., 1999); (2) Diabetes diagnoses in

Pima indians data set (Blake and Merz, 1998); (3)

Fisher’s Iris data set (Anderson, 1953); (4) Thy-

roid dysfunction data set (Quinlan, 1987); and one

prediction simulation: (1) Mackey-Glass time series

(Mackey and Glass, 1977).

In Table 1 is presented a summary of the used

databases. In this table are shown: the number of ex-

amples; the number of binary (b), continuous (c) and

total (t) features; the number of binary (b) classes; and

the class entropy in bits per example.

Table 1: Characteristics of the used datasets.

Database Examples Features Class Entropy

b c t b c

Artiﬁcial Nose 5.400 0 6 6 3 0 1.5849

Iris 150 0 4 4 3 0 1.5849

Diabetes 768 0 8 8 2 0 0.9332

Thyroid 7.200 9 6 21 3 0 0.4468

MGlass 1.000 0 4 4 0 1 2.000

Training Methodology. The local training al-

gorithm used is the backpropagation method. The

learning rate was set at 0.001 and the momentum term

at 0.7. Each simulation contains 30 runs performed

with different random weight initializations. Training

HYBRID OPTIMIZATION TECHNIQUE FOR ARTIFICIAL NEURAL NETWORKS DESIGN

243

was stopped if: (i) the GL

criterion was satisﬁed

twice (to avoid initial oscillations in validation

errors); (ii) the training progress criterion was met,

with P

(t) < 0.1; or (iii) a maximum number of

iterations is reached.

Subset Selection - Classical Methods. In ex-

periments, we use different search strategies (For-

ward, Backward and Random), considering the char-

acteristics of the search technique. The Forward strat-

egy starts with the empty set and adds features. The

Backward strategy starts with the full set and deletes

features. The Random approach starts from a ran-

dom set and randomly performs the addition and re-

motion of features. Unlike the others, the Random

Bit Climber method has attributes removed and added

during the search process. Thus, in order to carry out

the the search in different directions, we used different

initial states - initial solution without features, with all

features and randomly selected features. The Las Ve-

gas and the proposed GaTSa methods implement their

own search strategies.

To determine the classiﬁcation accuracy for the

classical methods (Hill-Climbing, Best-ﬁrst, Random

Bit Climber and Las Vegas), a K-Nearest Neighbor

(k-NN) classiﬁcation algorithm is used. In the k-NN

algorithm, the number of k is 7 deﬁned empirically.

4 RESULTS AND DISCUSSION

For SA, TS and GA the maximal topology in the Ar-

tiﬁcial Nose data set (A) contains six input units, ten

hidden units and three output units (N1 = 6, N2 = 10

and N3 = 3, the maximum number of connections

max

) is equal to 90). In the Iris data set (B) N1 = 4,

N2 = 5, N3 = 3 and N

max

= 32. For the Thyroid data

set (C) N1 = 21, N2 = 10, N3 = 3 and N

max

= 240.

In the Diabetes data set (D) N1 = 8, N2 = 10, N3 = 2

and N

max

= 100. In the Mackey-Glass (E) experi-

ments N1 = 4, N2 = 4, N3 = 1 and N

max

= 50. In

all neural network topologies, N1 and N3 values are

problem-dependent and N2 was obtained in experi-

ments from (Zanchettin and Ludermir, 2006). For

GaTSa, the same values for N1 and N3 are used, but

the value of N2 is optimized, together with the net-

work weights and connections, in a constructive man-

ner.

Figure 1 displays the average performance of each

optimization technique investigated. These results

were obtained for each technique in the optimization

of the number of connections and weight connection

values of an MLP artiﬁcial neural network. The pa-

rameters evaluated were: (1) Squared Error Percent-

age (SEP) and the classiﬁcation error (Class) of the

test set; and (2) Percentage of network connections.

This ﬁgure displays the average results of 10 simula-

tions. Each simulation contains 30 different runs of

the algorithms.

Genetic algorithms, tabu search and simulated an-

nealing methods incorporate domain speciﬁc knowl-

edge in their search heuristics. They also tolerate

some elements of non-determinism, which helps the

search escape from local minima. The proposed in-

tegration combines these advantages in order to use a

larger amount of information in the problem domain

and apply this information to practically all search

phases. The initial solution is coded with a minimum

valid network topology and hidden nodes are inserted

in the network topology during algorithm execution.

This process is similar to constructive neural network

training and allows better topology selection. More-

over, the proposed methodology has two well-deﬁned

stages: a global search phase, which makes use of the

capacity for generating new solutions from the ge-

netic algorithms, the cooling process and cost func-

tion of the simulated annealing as well as the memory

characteristics of the tabu search technique; and a lo-

cal search phase, which makes use of characteristics

such as gradient descending for a more precise solu-

tion adjustment.

For all data sets, the optimized neural network

obtain a lower classiﬁcation error than those ob-

tained by MLP networks without topology optimiza-

tion (Zanchettin and Ludermir, 2006) and the mean

number of connections is much lower than the maxi-

mum number allowed. Greater number of simulations

the best performance to optimize MLP architecture

was obtained by the method GaTSa.

It is important to note that in experiments with

GaTSa the average of connections number was com-

puted in relation to the maximum network topology

generated, rather than being calculated with the max-

imum ﬁxed topology (as in the other models). This

seemed to be the fairest approach, however, in some

ways its harmed the model, because most of the time

the proposed method has generated topology with less

connections than the maximum allowed.

Statistically, the GaTSa method achieves better

optimization of the architecture input nodes. The

MLP performance obtained from the optimized

neural networks was statistically equivalent for the

Thyroid, Diabetes and Mackey Glass data sets. The

GaTSa method obtained better results in the Artiﬁcial

Nose data set, whereas GA had the best performance

in the Iris data set.

GaTSa - The Cost Function Inﬂuence. Table 2

displays the experiment results, these values are the

average performance from 10 simulations. Each

ICEIS 2009 - International Conference on Enterprise Information Systems

244

Figure 1: Optimization technique performance.

simulation contains 30 different runs of the algo-

rithms. In this table, the evaluated cost functions are:

the Average method (Average), Weighted Average

(WA), Weight-Decay (WD), Multi-Objective (MO)

and the combination of Multi-Objective and Weight-

Decay (MO+WD). The parameters evaluated were:

(1) Squared Error Percentage (SEP) and the classiﬁ-

cation error (Class) of the test set; and (2) Percentage

of network connections.

Table 2: Experiment Results.

Average WA WD MO MO+WD

A Class. (%) 11.86 11.08 7.55 7.04 12.62

Connec. 50.24 53.38 33.24 49.65 42.05

B Class. (%) 6.12 5.46 6.93 4.27 3.98

Connec. 13.38 16.13 18.11 15.86 9.80

C Class. (%) 7.10 7.17 6.81 6.92 6.86

Connec. 83.88 99.32 91.67 114.66 115.09

D Class. (%) 28.45 28.43 25.75 28.26 25.85

Connec. 31.84 38.18 31.74 43.27 42.03

E SEP Test 0.62 0.80 0.27 0.57 0.62

Connec. 2.15 2.54 2.08 3.63 3.32

In the Artiﬁcial Nose data set, the best classiﬁ-

cation results were obtained by the multi-objective

approach and the best architecture optimization was

found by the weight-decay method. In the Iris data

set, the combination of weight-decay and genetic op-

erators using multi-objective optimization presented

the best performance. The weight-decay cost func-

tion presented the best optimization performance in

the Thyroid, Diabetes and Mackey-Glass data sets.

In some analyses, the Artiﬁcial Nose, Diabetes

and Thyroid data sets presented absolute determin-

istic and absolute random noise. The sources of ab-

solute deterministic noise are computational errors

and systematic measurement errors. Absolute ran-

dom noise is typical in optimization problems such

as adaptation, learning and pattern recognition. This

noise probably inﬂuenced the experiments, but the

average performance of the cost functions was con-

ﬁrmed.

The better performance of weight-decay demon-

strates the capacity of this method for restricting the

type of functionality that the network can produce by

favoring networks that produce smoother functions.

Smooth output functions are generally more likely to

represent the underlying functions of real-world data.

Moreover, the use of weight-decay can modify the er-

ror surface of a given problem in such a way as to

reduce the growth of large update values.

The use of multi-objective optimization in ge-

netic operators presented interesting results in some

data sets, but exhibited poor performance in most.

The problem was to choose the best solution when

there were several solutions with same position in

the Pareto ranking. In this case the solution with the

lesser classiﬁcation error was chosen. However, there

is no guarantee that this solution has a small number

of connections. Possibly the best solution would be to

choose the best solution with the average of classiﬁ-

cation error and the connections number.

The best problem search space exploration

was obtained with the use of the weight-decay

cost function in the experiment analysis. This

method generated solutions with low complexity

topology and low number of errors. The superior-

ity of the method was statistically veriﬁed in the

Mackey-Glass, Diabetes and Artiﬁcial Nose data sets.

GaTSa - Feature Subset Selection. Figure 2

diplays the results of the k-NN classiﬁer (for classical

methods) and MLP (for GaTSa method) in datasets

with all attributes. The labels Attrib. and Class.

means the number of features selected by the tech-

nique and classiﬁcation performance, respectively.

The ﬁgure displays the best results of the average

performance - number of attributes following feature

selection and performance classiﬁcation - of the meth-

ods, considering the results of the feature selection

technique with the Forward, Backward and Random

search strategies.

In experiments, the performance of the algorithms

was obtained in a 10-fold cross-validation process,

the same method used to evaluate the precision of the

HYBRID OPTIMIZATION TECHNIQUE FOR ARTIFICIAL NEURAL NETWORKS DESIGN

245

classiﬁer with all features. The results of the random

strategy correspond to the average of 10 runs, due to

the random characteristics of this model.

Figure 2: Feature selection performance.

Table 3: GaTSa experiment conﬁguration.

The Best-First method obtained interesting results

regarding feature selection. The k-NN classiﬁer also

exhibited a good performance. In the Thyroid data

set, k-NN obtained better results than MLP fully con-

nected with all attributes (MLP obtained a 7.38%

classiﬁcation error). The characteristics of the dataset

can explain these results. This database has a non-

balanced data distribution; the class probability dis-

tribution are 5.1%, 96.2% and 2.3%, respectively.

Different from the previous experiments, the Hill-

Climbing Forward variant obtained worse results than

the Backward and Random variants. The characteris-

tic of Forward strategy is the evaluation of a small

number of subsets. This tendency suggests a deﬁ-

ciency of the method to escape from the local mini-

mum.

The superiority of Random Bit Climber over Hill-

Climbing method may demonstrate its lessor ten-

dency to become trapped in local minimums. The

Las Vegas algorithm uses brute force, creating ran-

dom subset solutions without any heuristics to opti-

mize the process. With this search strategy, the num-

ber of evaluations is high, even when the search space

is small. Nevertheless, random algorithms have found

interesting and relatively simple solutions to search

problems.

The simulations carried out with the Las Vegas

and Best-First algorithms (Random) obtained the best

results in Fisher Iris data set - where only one attribute

was selected. In the Thyroid and Diabetes data sets,

the best algorithms were Hill-Climbing and Random

Bit Climber (Random) - with two and ﬁve attributes

selected, respectively. The method classiﬁcation per-

formance was similar in practically all experiments.

The classiﬁcation degradation regarding the simula-

tions with all attributes was low.

The proposed GaTSa method obtained interesting

results in relevant feature selection. The method ob-

tained results very close to classical feature selection

methods in the investigated problems. Despite not

being speciﬁcally developed for relevant feature se-

lection, the GaTSa method, even though in a indi-

rect way, is able to eliminate non-relevant features to

learning algorithms.

Although the method shown promising results,

it was observed a difﬁculty of the model in getting

good performance when missing information in the

database is present and classiﬁcation task is not per-

formed with low errors. This characteristic was ob-

served in some runs of the algorithm, in which the

network had not obtained a good performance of clas-

siﬁcation. This occurs because the elimination of con-

nections is not taking into consideration each input

node of the network. The elimination of an input node

only happens if all connections that connect this node

are eliminated. The unselection of a feature occurs

as a consequence of the process of connection reduc-

tion. The proposed method does not verify the con-

tribution of each input feature during the optimiza-

tion, because the emphasis is on the contribution of

each network connection. If the focus of the optimiza-

tion is the selection of relevant features the designer

could possibly consider other characteristics in terms

of cost function evaluation beyond the error classiﬁ-

cation and the number of connections.

Design of Experiments. Design of experiments

(Montgomery, 2001) was applied in order to deter-

mine the factors with the greatest inﬂuence on the

systems performance. When analyzing the inﬂuence

of each of these parameters, the designer should pay

most attention to the ones presenting the values that

are statistically most signiﬁcant. Thus, it should be

possible to avoid the necessity for a detailed analy-

sis of different conﬁgurations that might, in fact, lead

to the design of various models with very similar be-

havior patterns. We expect that a small amount of

parameters have great inﬂuence on the model average

performance.

In the study performed with GaTSa, we opted to

accomplish a factorial experiment with two levels (2

factorial experiment), seeking to reduce the amount

of experiments done. Table 3 presents the controlled

factors.

The analyses were accomplished in aleatory man-

ICEIS 2009 - International Conference on Enterprise Information Systems

246

ner. Seven control factors (variables) were consid-

ered, each one of them with two levels, resulting in

128 combinations. Each one of the levels combi-

nations of the control factors was accomplished ﬁve

times, totalizing 640 analyses. Due to random charac-

teristics of the model, each of the 640 analyses means

the average of 30 algorithm runs, so that 19.200 sim-

ulations were performed with the proposed method.

Through the variance analysis of the factorial ex-

periment, considering the statistical signiﬁcance level

of 5% in the F distribution, 2 factors were identiﬁed

with larger inﬂuence on performance of the MLP op-

timize by the proposed model.

The more relevant factors are: Number of micro-

evolution in the genetic operators (D), corresponding

to ≈ 32.64% of the system variance; Genetic oper-

ators crossover rate (E), corresponding to ≈ 24.04%

of the variance. It was also identiﬁed the interaction

(variation among the differences between means for

different levels of one factor over different levels of

the other factor) among the factors: Number of micro-

evolution in the genetic operators (D) and Genetic op-

erators crossover rate (E), corresponding to ≈ 10.39%

of the system variance; and Neighborhood size (A)

and List tabu size (G), corresponding to ≈ 1.82% of

the total data variance.

It is interesting to note that despite the large

amount of conﬁgurable parameters of the method, just

few have signiﬁcant inﬂuence on the performance of

the optimized artiﬁcial neural networks. This is an

interesting characteristic because even inexperienced

designers can achieve success in it use. The param-

eters most inﬂuent in the method performance were

the variables that control the solutions evolution in the

search space. The method is robust to other settings

because it did not signiﬁcant changes in the networks

generated performance.

5 FINAL REMARKS

The proposed method has been developed as an gen-

eral iterative algorithm. Thus, it can easily be engi-

neered to implement any combinatorial optimization

problem. The performance of the proposed technique

is evaluated in the simultaneous optimization of the

number of connections and weight connection values

among processing units of the MLP. It is important

to remember that the problem investigated involves a

critical subject, the stability versus plasticity relation

in the training of artiﬁcial neural networks.

It was also evaluated the performance of different

cost functions in MLP training. Five cost functions

were evaluated, the best performance was obtained by

the weight-decay approach in the benchmarks inves-

tigated, which were composed of four classiﬁcation

problems and one prediction problem.

Experiments also demonstrated that the method

can be used for relevant feature selection. During

the topology network optimization, input processing

units may be eliminated in accordance with the per-

formance of the method. Thus, a feature selection is

obtained that is inherent to the artiﬁcial neural net-

work optimization process. In simulations, we com-

pared the proposed method with the Hill-Climbing,

Random Bit Climber, Best-First and Las Vegas meth-

ods in three classiﬁcation problems.

It was yet presented a study of the different

parameters involved in the design of the proposed

model. The design of experiments was used to an-

alyze and compare experiments by describing the

statistically interactions and interrelations between

proposed work parameters. The experiments indi-

cate that the most relevant parameters are number of

micro-evolution in the genetic operators and genetic

operators crossover rate.

Considering the data sets used in this work, the

method was able to generate automatically MLP

topologies with much fewer connections than the

maximum number allowed. The results also generate

interesting conclusions about the importance of each

input feature in the classiﬁcation and prediction task.

REFERENCES

Anderson, E. (1953). The irises of the gasp peninsula. Bul-

letin of the American Iris Society, 59:2–5.

Blake, C. L. and Merz, C. J. (1998). Uci repository of ma-

chine learning databases. Technical report, University

of California, Irvine, CA.

de Souza, J., Neto, B., dos Santos, F., C. de Melo, M. S., and

Ludermir, T. (1999). Polypyrrole based aroma sensor.

Synthetic Metals, 102:1296–1299.

Mackey, M. C. and Glass, L. (1977). Oscillation and chaos

in physic. control systems. Science, 197:287–289.

Montgomery, D. C. (2001). Design and Analysis of Exper-

iments (Fifth Edition). John Wiley and Sons.

Quinlan, J. (1987). Simplifying decision trees. Int. Journal

of Man-Machine Studies, 27:221234.

Zanchettin, C. and Ludermir, T. B. (2006). A methodology

to train and improve artiﬁcial neural networks weights

and connections. In Int. Joint Conf. on Neural Net-

works, pages 5267–5274. IEEE Computer Society.

The authors would like to thank CNPq (Brazilian re-

search agency) for their ﬁnancial support.

HYBRID OPTIMIZATION TECHNIQUE FOR ARTIFICIAL NEURAL NETWORKS DESIGN

247