
 
problem. Each particle is evaluated using the 
following equation: 
Nc
Cmzd
J
c
ijp
N
jCZ
ijijp
e
∑∑
=∈∀
=
1
]||/).([
 
(F1)
 
where  Zp denotes the p
th
 data vector, | C
ij
 | is the 
number of data vectors belonging to the cluster C
ij
 
and d is the Euclidian distance between Zp and m
ij
. 
3.1  The Evaluation Function 
The Evaluation function plays a fundamental role in 
any evolutionary algorithm; it tells how good a 
solution is.  
By analyzing the equation F1 we can see that it is 
first takes each cluster C
ij
 and calculates the average 
distance of the data belonging to the cluster to its 
centroid  m
ij
. Then it takes the average distances of 
all clusters C
ij
 and calculates another average, which 
is the result of the equation. 
It can be seen that a cluster C
ij
 with just one data 
vector will influence the final result (the quality) as 
much as a cluster C
ik
 with lot of data vectors. 
Sometimes a particle that does not represent a 
good solution is going to be evaluated as if it did. 
For instance, suppose that one of the particle clusters 
has a data vector that is very close to its centroid, 
and another cluster has a lot of data vectors that are 
not so close to the centroid. This is not a very good 
solution, but giving the same weight to the cluster 
with one data vector as the cluster with a lot of data 
vectors can make it seem to be. Furthermore, this 
equation is not going to reward the homogeneous 
solutions, that is, solutions where the data vectors 
are well distributed along the clusters. 
To solve this problem we propose the following 
new equations, where the number of data vectors 
belonging to each cluster is taken into account: 
∑∑
=∈∀
×=
c
ijp
N
j
oij
CZ
ijijp
NCCmzdF
1
)]}/|(|)||/).([({
 
(F2)
 
Where  N
o
 is the number of data vectors to be 
clustered. 
To take into account the distribution of the data 
among the clusters, the equation can be changed to: 
)1|||(|' +−×= ilik CCFF
 
(F3)
 
such that,      
|}{|max|| ,..,1 ijNcjik CC =∀=
 and    
|}{|min|| ,..,1 ijNcjil CxC =∀=
 
The next section shows the test results with these 
different equations. 
4 RESULTS 
Table 1 shows the three benchmarks that used: Iris, 
Wine and Glass, taken from the UCI Repository of 
Machine Learning Databases. (Assuncion, 2007). 
Table 1: Benchmarks features. 
Benchmark Number of 
Objects 
Number of 
Attributes 
Number 
of 
Classes 
Iris  150 4  3 
Wine  178 13 3 
Glass  214 9  7 
For each data set, three implementations, using the 
equations F1, F2 and F3, were run 30 times, with 
200 function evaluations and 10 particles, w = 0.72, 
c1 = 1.49, c2 = 1.49. (Merwe 2003).  
Each benchmark class is represented by the 
particle created cluster with largest number of data 
of that class; data of different classes within this 
cluster are considered misclassified. Thus the hit rate 
of the algorithm can be easily calculated. 
The average hit rate t over the 30 simulations ± 
the standard deviation σ of each implementation is 
presented in Table 2. 
As can be seen on Table 2 the changes on the 
fitness function brought good improvements to the 
results on the evaluated benchmarks. It is important 
to notice that equation F3 pushes the particles 
towards clusters with more uniformly distributed 
data, so it should be used on problems in witch is 
previously known that clusters have uniform 
distribution sizes, otherwise, equation F2 should be 
used. On Iris, in witch clusters have uniform sizes, 
equation F3 produced very good results, even 
though equation F2 produced good results too. The 
improvements on the others benchmarks are also 
satisfactory.  
On Figure 1, the convergence of the three 
functions is shown. As a characteristic of the PSO, 
they all have a fast convergence. 
On Figures 2, 3 and 4, some examples of 
clustering found can be seen. On Figure 2 contains 
some examples of clustering for the Iris benchmark, 
on the algorithm using function F1 found the correct 
group for 71,9% of data, on Figure 3 the F2 found 
the correct group for 88,6%, and on Figure 4 the F3 
found the correct group for 85,3%. It can be seen 
that F2 and F3 totally distinguished the class setosa 
(squares) from the other classes.  
NEW APPROACHES TO CLUSTERING DATA - Using the Particle Swarm Optimization Algorithm
595