Weighted Evidence Accumulation Clustering

Using Subsampling

F. Jorge F. Duarte

1

, Ana L. N. Fred

2

, Fátima Rodrigues

1

,

João M. M. Duarte

1

and André Lourenço

1

GECAD – Knowledge Engineering and Decision Support Group

Instituto Superior de Engenharia do Porto, Instituto Superior Politécnico, Porto, Portugal

2

Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal

Abstract. We introduce an approach based on evidence accumulation (EAC)

for combining partitions in a clustering ensemble. EAC uses a voting mecha-

nism to produce a co-association matrix based on the pairwise associations ob-

tained from N partitions and where each partition has equal weight in the com-

bination process. By applying a clustering algorithm to this co-association ma-

trix we obtain the final data partition. In this paper we propose a clustering en-

semble combination approach that uses subsampling and that weights differ-

ently the partitions (WEACS). We use two ways of weighting each partition:

SWEACS, using a single validation index, and JWEACS, using a committee of

indices. We compare combination results with the EAC technique and the

HGPA, MCLA and CSPA methods by Strehl and Gosh using subsampling, and

conclude that the WEACS approaches generally obtain better results. As a

complementary step to the WEACS approach, we combine all the final data

partitions produced by the different variations of the method and use the Ward

Link algorithm to obtain the final data partition.

1 Introduction

Clustering is a procedure of partitioning data into groups or clusters based on a con-

cept of proximity or similarity between data. There is a huge amount of clustering

algorithms, even though no single algorithm can successfully discover by itself all

types of cluster shapes and structures. Recently, clustering ensemble approaches were

introduced [1-7,22-28] based on the idea of combining the partitions of a cluster en-

semble into a final data partition.

The concept underlying to EAC method, by Fred and Jain, is to combine the re-

sults of a cluster ensemble into a single combined final data partition, considering

each clustering result as an independent evidence of data organization. Using a voting

mechanism and taking the pairwise associations as votes, the N data partitions of n

patterns are mapped into an n

× n co-association matrix:

NvotesjiassocCo

ij

/),(_ =

(1)

Jorge F. Duarte F., L. N. Fred A., Rodrigues F., M. M. Duarte J. and Lourenço A. (2006).

Weighted Evidence Accumulation Clustering Using Subsampling.

In 6th International Workshop on Pattern Recognition in Information Systems, pages 104-116

DOI: 10.5220/0002504501040116

Copyright

c

SciTePress

where votes

ij

is the number of times the pattern pair (i,j) is assigned to the same clus-

ter among the N clusterings. The final data partition (P*) is obtained by applying a

clustering algorithm to the co-association matrix. The final number of clusters can be

fixed or automatically chosen using lifetime criteria [2,3].

Strehl and Ghosh explored graph theoretical concepts in the combination of clus-

tering ensembles. The partitions included in the clustering ensemble are mapped into

a hypergraph, where vertices correspond to samples, and partitions correspond to

hyperedges. They proposed three heuristics to try to answer the combination problem:

the hypergraph-partition algorithm (HGPA), the meta clustering algorithm (MCLA)

and the cluster-based similarity partitioning algorithm (CSPA).

Duarte et al. proposed the WEAC approach [4,5], also based on evidence accumu-

lation clustering. WEAC uses a weighted voting mechanism to integrate the partitions

of the clustering ensemble in a weighted co-association matrix. Two different meth-

ods are followed: SWEAC, where each clustering is evaluated by a relative or inter-

nal cluster validity index and the contribution of each clustering is weighted by the

value achieved for this index; JWEAC, where each clustering is evaluated by a set of

relative and internal cluster validity indices and the contribution of each clustering is

weighted by the overall results achieved with these indices. The final data partition is

obtained by applying a clustering algorithm to the weighted co-association matrix.

In this paper we test how subsampling techniques influence the combination re-

sults using the WEAC approach (WEAC with subsampling, WEACS). Partitions in

the ensemble are generated by clustering subsamples of the data set. Each subsample

has 80% of the elements of the data set. As with the WEAC approach, two different

methods are used to weight data partitions in the co-association matrix (w_co_assoc

matrix): Single Weighted EAC with subsampling (SWEACS) and Joint Weighted

EAC with subsampling (JWEACS).

We assessed experimentally the performance of the WEACS approach and com-

pared it with the single application of Single Link, Complete Link, Average Link, K-

means and Clarans algorithms and with the subsampling versions of EAC, HGPA,

MCLA and CSPA methods.

Section 2 summarize the cluster validity indices used in WEACS. Section 3 pre-

sents the Weighted Evidence Accumulation Clustering with subsampling (WEACS)

and the experimental setup used. In section 4 synthetic and real data sets are used to

assess the performance of WEACS. Finally, in section 5 we present the conclusions.

2 Cluster Validity Indices

Cluster validity indices address the following two important questions associated to

any clustering: how many clusters are present in the data; and how good the cluster-

ing itself is. For a summary of cluster validity measures and comparative studies see

for example [8,9] and the references therein.

We can use three approaches to do cluster validity [10]: external validity indices

assess the clustering results based on a structure that is assumed on the data set

(ground truth); internal validity indices assess the clustering results in terms of quanti-

ties that involve the vectors of the data set themselves; and relative validity indices

105

assess a clustering result by comparing it with other clustering results, obtained by the

same algorithm but with different input parameters.

In this work, we employed a set of widely used and referenced internal and relative

cluster validity indices, to evaluate the quality of the clusterings to be included and

weighted in the w_co_assoc matrix. We used two internal indices, the Hubert Statistic

and Normalized Hubert Statistic (NormHub) [11], and fourteen relative indices: Dunn

index [12], Davies-Bouldin index (DB) [13], Root-mean-square standard error

(RMSSDT) [14], R-squared index (RS) [14], the SD validity index [9], the S_Dbw

validity index [9], Caliski & Cooper cluster validity index [15], Silhouette statistic (S)

[16], index I [17], XB cluster validity index [18], Squared Error index (SE),

Krzanowski & Lai (KL) cluster validity index [19], Hartigan cluster validity index

(H) [20] and the Point Symmetry index (PS) [21].

3 Weighted Evidence Accumulation Clustering Using

Subsampling (WEACS)

The WEACS approach is an extension of the WEAC approach [4,5] by using sub-

sampling in the construction of the cluster ensemble. Both methods extend the EAC

technique by weighting differently data partitions in the combination process accord-

ing to cluster validity indices. The use of subsampling in WEACS has two main rea-

sons: to create diversity in the cluster ensemble and to test the robustness of the

method. In fact, other works have shown that the use of subsampling increase diver-

sity in the cluster ensemble leading to more robust solutions [22,24,26].

Like in WEAC, WEACS proposes the evaluation of the quality of each data parti-

tion by one or more cluster validity indices, which ultimately determines its weight in

the combination process. We can obtain poor clustering results in a simple voting

mechanism, if a set of poor clusterings overshadows another isolated good clustering.

By weighting the partitions in the weighted co-association matrix according to the

evaluation made by cluster validity and by assigning higher relevance to better parti-

tions in the clustering ensemble, we expect to achieve better combination results.

Considering n the number of patterns in a data set and given a clustering ensemble

P

=

{

}

N

PPP ,...,,

21

with N partitions of n*0.8 patterns produced by clustering sub-

samples of the data set, and a corresponding set of normalized indices with values in

the interval [0,1] measuring the quality of each of these partitions, the clustering en-

semble is mapped into a weighted co-association matrix:

w_co_assoc(i,j)=

1

.

(, )

L

N

Lij

L

vote VI

Si j

=

∑

,

(2)

where N is the number of clusterings, vote

Lij

is a binary value, 1 or 0, depending if the

object pair (i,j) has co-occurred in the same cluster (or not) in the L

th

partition,

L

VI

is

the normalized cluster validity index value for the L

th

partition and (, )Si j is a matrix

such that (i,j)-th entry is equal to the number of data partitions from the total N data

partitions where both patterns i and j are simultaneous present. The final data partition

106

is obtained by applying a clustering algorithm to the weighted co-association matrix.

The proposed WEACS method is schematically described in table 1.

Table 1. WEACS approach.

Input:

n – number of data patterns of the data set

P =

{

}

N

PPP ,...,,

21

- Clustering Ensemble with N data partitions of n*0.8 patterns

produced by clustering subsamples of the data set

{

}

N

VIVIVIVI ,...,,

21

=

- Normalized Cluster Validity Index values of the corre-

sponding data partitions

Output: Final combined data partitioning.

Initialization: set w_co_assoc to a null n

×

n matrix.

1. For L=1 to N

Update the w_co_assoc: for each pattern pair (i,j) in the same cluster, set

w_co_assoc(i,j)=w_co_assoc(i,j)+

.

(, )

L

Lij

vote VI

Si j

vote

Lij

- binary value (1 or 0), depending if the object pair (i,j) has co-occurred in

the same cluster (or not) in the L

th

partition

L

VI

- the normalized cluster validity index value for the L

th

partition

(, )Si j - number of data partitions where patterns i and j are present

2. Apply a clustering algorithm to the w_co_assoc matrix to obtain the final data

partition

In WEACS we used two different ways of weighting each data partition:

1. Single Weighted EAC with subsampling (SWEACS): in this method, the quality

of each data partition is evaluated by a single normalized relative or internal clus-

ter validity index, and each vote in the w_co_assoc matrix is weighted by the

value of this index:

L

VI =

(

)

_

L

norm validity P

(3)

2. Joint Weighted EAC with subsampling (JWEACS): in this method, the quality of

each data partition is evaluated by a set of relative and internal cluster validity in-

dices, and each vote in the w_co_assoc matrix being weighted by the overall con-

tributions of these indices:

L

VI =

(

)

1

_

L

NInd

ind

ind

norm validity P

NInd

=

∑

(4)

where

NInd

is the number of cluster validity indices used, and

(

)

_

L

ind

norm validity P

is the value of the ind

th

validity index over the partition P

L

.

We used sixteen cluster validity indices in our experiments.

In the WEACS approach we can use different clustering ensembles construction

methods, different clustering methods to obtain the final data partition, and, particu-

107

larly in the SWEACS version, we can use even different cluster validity indices to

weight the data partitions. These constitute variations of the approach, taking each of

the possible modifications as a configuration parameter of the method. As shown in

the experimental results section, although the WEACS leads in general to good re-

sults, no individual configuration tested led consistently to better best results in all

data sets as compared to the subsampling versions of EAC, HGPA, MCLA and CSPA

methods. Strehl and Gosh [6] proposed to use the average normalized mutual infor-

mation (ANMI) as criteria for selecting among the results produced by different

strategies. The “best” solution is chosen as the one that has maximum average mutual

information with all individual partitions of the clustering ensemble. By comparing

the best results according to the consistency index with ground truth information (P

0

),

(Ci(P*,P

0

)), with the correspondent consensus values (ANMI) it was proved in [28]

and we could confirm in this work that there is no correlation between these two

measures; the mutual information based consensus function is therefore not suitable

for the selection of the best performing method.

To solve this problem we use a complementary step to the WEACS approach. It

consists in combining all the final data partitions obtained in the WEACS approach

with a clustering ensemble construction method or in combining of all the final data

partitions obtained in the WEACS approach with all clustering ensemble construction

methods. These data partitions are combined using the EAC approach and the final

data partition (P*) is obtained by applying a clustering algorithm to this new co-

association matrix.

3.1 Experimental Setup

3.1.1 Generation of Clustering Ensembles

There are several different approaches to produce clustering ensembles. We produced

clustering ensembles using a single algorithm (Single Link (SL), Complete-Link

(CL), Average-Link (AL), K-means and Clarans (CLR)) with different parameters

values and/or initializations, and using diverse clustering algorithms with diverse

parameters values and/or initializations. Specifically, each clustering algorithm makes

use of multiple values of k and K-means and Clarans in addition make use of multiple

initializations of clusters centers. We investigated also a clustering ensemble that

includes all the partitions generated by all the clusterings algorithms (ALL).

3.1.2 Normalization of Cluster Validity Indices

We can find two types of indices: some of them are intrinsically normalized and oth-

ers are not. In this work we use two indices intrinsically normalized and fourteen that

are not. The Normalized Hubert Statistic and Silhouette index are normalized be-

tween [-1,1] but we only consider values between [0,1].We use two internal validity

indices and fourteen relative validity indices. The best result for some indices is the

highest value and for others the lowest value. When the indices of the first type only

have values superior to zero, the normalization is made by dividing the value obtained

for the index by the maximum value obtained over all partitions (in-

dex_value=value_obtained/Maximum_value). When the indices of the second type

only have values superior to zero, the normalization is made by dividing the mini-

108

mum value obtained over all partitions by the partition value obtained for the index.

(index_value= Minimum_value/value_obtained). Some other indices increase (or

decrease) as the number of clusters increase and it is impossible to find neither the

maximum nor the minimum. With these indices, we look for the value of k where the

major local variation in the value of the index happens. This variation appears as a

“knee” in the plot and corresponds to the number of clusters existent in the data set.

The best value of this kind of indices typically is not the highest (or lowest) value

achieved. Thus, these indices can’t be incorporated directly in the w_co_assoc matrix.

The best value of these indices is where the “knee” appears. The value 1 is given to

the partition correspondent to the “knee” in the index. To incorporate these indices in

the co-association matrix we adopted the following approach: run the clustering algo-

rithms varying the number of clusters to be achieved between [1, k

maximum

] where

k

maximum

is the maximum number of clusters we suppose to exist in the data set; then,

we have to compare the partition correspondent to the “knee” with each of the other

partitions generated by this algorithm. We used an external index, the Consistency

index (C

i

), proposed in [1] to compare these clusterings. We utilized this approach to

Hubert Statistic, RMSSDT index, RS index and Squared Error index. The expected

number of clusters in Hartigan cluster validity index is the smallest k >=1 such that

H(k)<=10. Given that Hartigan index is not calculated for values of k greater than the

expected number of clusters (typically achieve negative values) we have to use to this

index the same procedure used to the indices based on the “knee” to achieve an index

value for partitions with k’s greater than the expected number of clusters. Table 2

shows the criteria to achieve the best value with each validity index.

Table 2. Criteria to obtain the best value according to each validity index.

Index Criteria Index Criteria Index Criteria Index Criteria

Hubert “Knee“ RMSSDT “Knee“ CH Max SE “Knee“

NormHub Max RS “Knee“ S Max KL Maximum

Dunn Max SD Min I Max H Smallest k:

H(k)<=10

DB Min S_Dbw Min XB Min PS Minimum

3.1.3 Extraction of the Final Combined Data Partition

The w_co_assoc matrix can be seen as a new similarity matrix between patterns; we

therefore apply a clustering algorithm to it to obtain the final combined data partition

P*. In our experiments, we assumed that the final number of clusters is known and

we used the k-means, SL, AL and Ward’s link (WR) algorithms to obtain the final

partition. To assess the performance of the combination methods, we compare the

final data partitions with ground truth information and we used the Consistency index

(Ci) to compare these partitions.

109

4 Experimental Results

4.1 Data Sets

Synthetic data sets For simplicity of visualization we considered 2-dimensional

patterns. These data sets were produced aiming the evaluation of the performance of

WEACS in a multiplicity of conditions, like distinct data sparseness in the feature

space, arbitrary shaped clusters, well separated and touching clusters. Figure 1 plots

these data sets.

(a) Bars (b) Cigar (c) Half Rings (e) Spiral

Fig. 1. Synthetic Data Sets.

The Bars data set has 2 classes (200 and 200) and the density of the patterns in-

creasing with increasing horizontal coordinate. The Cigar data set has 4 classes (100,

100, 25 and 25). The Half Rings data set is composed by 3 uniformly distributed

classes (150, 150 and 200) within half-ring envelops. The Spiral data set consists of

200 samples divided evenly in 2 classes.

Real Data Sets Four real-life data sets were considered to show the performance of

the WEACS: Breast Cancer, Iris, DNA microarrays and Handwritten Digits. The

Breast Cancer data set (http://www.ics.uci.edu/~mlearn/MLRepository.html) has 683

samples (9 features) spitted in two classes: Benign and Malignant. The Iris data set is

divided in three types of Iris plants (50 samples per class), characterized by 4 fea-

tures, and with one class well separated from the other two, which are intermingled.

The Yeast Cell data set (DNA microarrays) consists of the fluctuations of the gene

expression levels of over 6000 genes over two cell cycles. The available data set is

restricted to the 384 genes with 17 features (http://staff.washington.edu/kayee/model/)

whose expression level peak at different time points corresponding to the 5 phases of

the cell cycle. It was used the logarithm of the expression level (Log Yeast) and a

“standardized” version (Std Yeast) of the data (with mean 0 and variance 1). The

Handwritten Digits, is available at the UCI repository

(http://www.ics.uci.edu/~mlearn/MLRepository.html), and consists in 3823 samples,

each with 64 features. A subset (Optical) composed by the first 100 samples of all the

digits was used from a total of 3823 training samples (64 features).

4.2 Combination of Clustering Ensembles Using WEACS

The quality of the final data partition, P*, obtained with the WEACS method is

evaluated by calculating the consistency of P* with ground truth information P

0

,

110

using the Consistency index Ci(P*,P

0

). We assume that the true number of clusters is

known, being the number of clusters in P*.

Using subsamples of a data set (80% of the number of patterns in the data set), we

applied each of the clustering ensemble construction methods (SL, AL, CL, KM and

CLR) to generate 50 clustering ensembles each with 100 partitions with k randomly

chosen in the set {10,…,30}. Then, we applied the EAC, HGPA, MCLA, CSPA and

WEACS approaches to each of these clustering ensembles. Finally, we calculate the

average results over the 50 runs. Due to space limitations, it was not possible to pre-

sent results of the application of the subsampling version of EAC and WEACS ap-

proaches to all datasets. As an example, in table 3 we present C

i

(P*,P

0

) indices values

for SL, AL, CL, Clarans, K-means and ALL clustering ensembles with Std Yeast data

set. In this table, rows are grouped by the clustering ensembles construction method.

Inside each clustering ensemble construction method appears the four clustering

methods (K-means, SL, CL and WR) used to extract the final data partition. ALL

cluster ensemble construction method gather all the partitions produced by all the

methods (N=500).

Table 3. C

i

(P*,P

0

) indices values with Std Yeast data set.

EAC JWEACS Hubert Nhubert Dunn RMSSDT RS S_Dbw CH S index_I XB SE DB SD H KL PS

KM

31.66 31.43 30.19 31.01 31.73 31.28 31.19 30.04 27.90 35.16 31.29 29.82 31.58 29.96 31.16 30.44 31.65 29.48

SL

35.93 36.17 35.96 35.70 35.93 35.96 35.96 35.93 35.96 35.42 35.69 35.95 35.96 35.93 35.70 36.18 35.71 35.69

AL

36.18 36.23 35.71 35.94 35.72 35.71 35.71 35.94 36.69 35.42 35.98 35.97 35.71 36.42 36.21 36.92 35.98 35.98

WR

37.23 37.24 37.23 37.23 36.99 37.23 37.23 37.23 37.47 35.42 37.23 37.24 37.23 37.47 37.24 37.23 37.23 37.48

KM

66.23 62.72 63.35 65.89 63.36 64.16 63.76 62.94 64.58 65.65 65.64 63.49 65.98 63.71 64.49 64.34 64.92 64.11

SL

36.20 36.20 36.20 36.20 35.96 36.20 36.20 35.96 36.20 36.20 36.20 36.20 36.20 36.20 36.20 36.20 36.20 35.96

AL

47.66 47.74 47.74 47.66 48.22 47.74 47.74 47.74 56.51 48.41 55.80 47.74 47.74 48.17 47.66 47.74 47.74 47.74

WR

68.76 68.34 68.82 68.74 68.31 68.82 68.82 69.27 68.86 68.79 69.09 68.83 68.82 68.74 68.81 68.35 68.82 68.30

KM

53.57 56.62 57.63 56.97 55.02 56.90 54.66 55.98 55.76 52.26 52.79 49.30 47.12 53.54 56.55 54.39 57.22 57.84

SL

37.19 37.33 37.15 37.33 37.15 37.15 37.15 37.33 45.27 37.19 45.27 37.33 37.15 37.19 37.33 37.33 37.15 37.19

AL

66.74 66.64 68.11 66.65 66.74 68.11 68.11 66.75 68.18 66.45 67.89 66.42 68.11 66.74 66.69 68.11 68.11 66.69

WR

58.68 58.45 58.43 58.44 58.43 58.43 58.43 55.56 57.21 58.44 57.20 58.47 58.43 58.43 58.45 58.44 58.43 58.44

KM

55.42 53.58 66.64 56.31 61.12 60.75 53.58 55.45 58.19 64.08 58.95 56.55 58.61 58.56 56.48 57.67 50.88 61.02

SL

48.47 48.22 57.33 48.22 49.43 57.33 57.33 47.98 56.81 48.93 44.83 48.46 48.45 48.47 48.47 56.59 37.43 48.47

AL

69.45 69.38 69.39 69.44 69.13 69.39 69.39 69.45 69.42 69.42 69.44 69.36 69.41 69.43 69.42 69.38 69.41 69.42

WR

57.10 57.44 56.96 57.38 56.97 56.96 56.96 57.06 55.96 57.33 56.95 61.03 57.61 57.20 56.91 60.76 57.60 57.20

KM

48.57 52.97 61.71 55.73 53.55 57.68 52.74 52.81 55.03 58.94 58.52 59.88 49.14 51.89 55.27 48.53 54.52 55.45

SL

48.11 48.08 48.30 48.30 50.47 48.30 48.30 50.40 48.30 48.25 47.98 50.23 48.30 48.11 48.30 48.33 48.30 48.05

AL

68.65 66.97 66.97 66.97 66.99 65.07 65.07 66.97 67.13 66.97 64.98 66.98 65.07 66.97 66.97 66.97 65.07 66.96

WR

58.12 57.40 59.97 57.41 56.57 55.47 55.47 53.99 53.85 55.17 58.91 57.89 55.47 57.40 58.18 55.42 55.47 58.15

KM

55.05 62.66 66.24 60.07 59.45 62.47 63.64 56.93 50.90 57.44 56.33 57.44 66.06 53.05 62.90 62.62 61.15 56.33

SL

35.94 35.94 35.94 36.20 35.94 35.94 35.94 35.94 35.95 36.20 35.95 35.95 35.94 35.95 35.94 35.95 35.94 35.94

AL

36.71 37.73 37.71 67.47 36.76 37.71 37.71 37.19 68.65 68.66 68.67 36.47 37.73 36.72 37.67 36.70 37.73 36.73

WR

58.80

74.69

71.20 69.31 67.11 68.96 68.96 66.99 61.60

72.63

59.81 69.74 68.35 67.06 68.24 67.90 68.35 67.49

CLR

ALL

SL

AL

CL

KM

Comparing Ci results for the Std Yeast data set (table 3), we can see that both ver-

sions of the WEACS approach have a performance better than EAC. JWEACS ob-

tained 74,69% and SWEACS 72,63% in the best result over all cluster validity indi-

ces versus 69.45% of EAC. Analyzing the experimental results in the nine data sets,

we can see that none of the ensemble combination approaches systematically pro-

duces the best results in all the situations. However, in average, SWEACS and

JWEACS approaches produce better results when compared with EAC. The

JWEACS and the SWEACS results for each cluster validity index are in many situa-

tions equal to EAC results, in other situations the EAC results are improved with the

SWEACS and JWEACS approaches and in fewer situations the EAC results are bet-

ter than those of SWEACS and JWEACS.

By examining the clustering ensemble construction methods, we can observe that

in 6 of the 9 data sets used, the partitions of the ALL clustering ensemble construc-

tion method provide the best results in the EAC, JWEACS and SWEACS methods.

Therefore, we can say that the joint of all the partitions produced by all the clustering

ensemble construction methods is a good choice to construct cluster ensembles for

these approaches.

111

We obtained also results of the single application of each clustering algorithm (SL,

CL, AL, KM and CLR) to each data set. Table 4 presents best individual results pro-

duced by each clustering method (lines SL to KM) and best combined results per

combination strategy (lines EAC to Strehl) over 50 runs. In the 7

th

line of the table we

present the best Ci result of the 3 Strehl & Gosh heuristics (HGPA, MCLA and

CSPA).

Table 4. Ci results in SL, CL, AL algorithms and Ci best results in CLR, KM, Strehl, EAC and

WEACS approaches over 50 runs.

Spiral Log Yeast Std Yeast Optical Cigar Breast Iris Half Rings Bars

SL

100 34.9 36.2 10.6 60.4 65.15 68 95 50.25

CL

52 28.91 66.67 51.8 55.6 92.83 84 72 98.75

AL

52 28.65 65.89 75.7 87.2 94.29 90.67 73.4 98.75

CLR

64.5 38.28 71.61 79.4 98 96.34 93.33 81.2 98.75

KM

64.5 35.94 71.09 77.5 74.8 96.49 91.33 77.6 99.5

Strehl

100 37.94 65.57 84.98 72.81 96.48 98 95.05 99.5

EAC

100 40.93 69.45 82.73 100 97.07 93.95 100 99.5

SWEACS 100 41.58 72.63 84.31 100 97.2 93.33 100 99.5

JWEACS 100 41.51 74.69 82.39 100 97.07 93.33 100 99.5

Almost in all data sets the WEACS results outperform the single application of all

the clustering algorithms. In the Log Yeast and Std Yeast data sets, we can see the

superiority of the SWEACS and JWEACS approaches. In Cigar and Half Rings data

sets, both the EAC and WEACS approaches obtain 100%, which are much better

results than the ones obtained by other algorithms. The SWEACS approach obtained

in 4 data sets better best results than the EAC approach, in 5 data sets better best

results than the best result of the Strehl heuristics and in 3 data sets better best results

than the JWEACS version. On other hand, the EAC approach obtained only in 1 data

set a better best result than the SWEACS approach, in 2 data sets a better best result

than the JWEACS approach and in 5 data sets better best results than the best result

of the Strehl heuristics. Strehl heuristics obtained in 2 data sets better best results than

the EAC approach, in 2 data sets better best results than the SWEACS approach and

in 2 data sets better best results than the JWEACS approach. The JWEACS approach

obtained in 2 data sets better best results than the EAC approach, in 5 data sets better

best results than the best result of the Strehl heuristics and in 1 data set a better best

result than the SWEACS approach. The average percentage of improvement in the

best results of SWEACS as compared to EAC, over all data sets, was of 0,55% while

the average percentage of improvement in the best results of JWEACS as compared

to EAC, over all data sets, was of 0,54%. The average percentage of improvement in

the best results of SWEACS as compared to Strehl heuristics, over all data sets, was

of 4,25% while the average percentage of improvement in the best results of

JWEACS as compared to Strehl heuristics, over all data sets, was of 4,24%.

Table 5 shows the average Ci results of the CLR and KM algorithms and of the

combining clustering ensemble approaches over 50 runs. In the 4

th

line of the table we

present the average Ci result of the 3 Strehl & Gosh heuristics (HGPA, MCLA and

CSPA). We can see that none of the methods obtain in all data sets the best average

Ci results. The CLR and KM algorithms and the EAC and Strehl & Gosh approaches

112

obtain two best average Ci results and the JWEACS approach obtains one best aver-

age Ci result.

Table 5. Average Ci results of CLR, KM, Strehl, EAC and WEACS approaches over 50 runs.

Spiral Log Yeast Std Yeast Optical Ciga

r

Breast Iris Half Rings Bars

CLR

57.40 31.58 62.53 73.80 71.20 95.61 89.37 76.55 97.06

KM

57.85 30.98 60.93 68.01 61.48 96.33 78.18 71.92 97.46

Strehl

68.37 32.25 51.59 69.44 68.11 80.03 93.72 93.04 96.04

EAC

71.78 34.78 50.68 58.55 83.23 77.32 70.38 84.98 83.19

SWEACS

70.97 34.37 51.96 57.23 82.13 77.55 71.84 83.24 80.71

JWEACS

70.83 34.62 51.67 57.60 83.63 77.37 70.76 84.86 83.24

Table 6 presents the C

i

results of all the final data partitions obtained after the ap-

plication of the complementary step to the WEACS approach.

Table 6. C

i

results of the final data partitions obtained after the applicaton of the complemen-

tary step to the WEACS approach.

Spiral Log Yeast Std Yeast Optical Cigar Breast Iris Half Rings Bars

KM 100.00 24.76 30.43 35.47 73.90 69.38 71.25 100.00 95.74

SL SL

100.00 34.94 35.95 11.60 94.98 65.15 65.36 100.00 95.75

AL 100.00 34.90 36.74 20.20 94.47 68.25 71.25 100.00 95.75

WR

100.00 28.29 31.99 43.76 95.01 68.33 71.25 100.00 95.75

KM 50.51 33.99 65.88 73.26 80.37 96.78 77.27 100.00 64.25

AL SL 99.22 35.42 43.91 67.70 97.66 65.15 69.10 60.32 64.25

AL 50.98 35.42 63.32 67.29 98.20 65.15 69.10 99.60 64.25

WR 50.74 31.79 68.30 84.29 98.07 96.78 78.27 95.00 64.25

KM

53.46 33.86 57.77 64.73 89.77 96.76 88.08 95.00 99.50

CL SL

96.40 30.21 67.86 60.42 99.58 95.35 74.67 95.00 67.78

AL

50.94 29.82 58.89 72.19 99.58 96.61 74.67 95.00 99.50

WR 51.20 34.78 57.86 72.90 99.98 96.61 74.80 95.00 99.50

KM 51.03 36.97 56.82 72.82 63.79 67.94 88.12 88.18 98.67

KM SL 68.49 40.89 62.65 57.26 70.80 64.57 89.40 72.41 98.67

AL 51.77 40.89 69.42 79.49 70.80 67.92 89.53 82.91 98.67

WR

51.81 40.89 55.16 78.08 70.80 67.94 89.53 99.23 98.67

KM

98.27 35.26 54.44 64.65 71.32 90.58 80.39 92.59 98.75

CLR SL

97.82 36.39 57.14 39.19 100.00 69.75 52.00 99.80 96.68

AL

79.96 34.81 67.10 78.55 100.00 69.65 52.00 93.64 98.75

WR 82.89 35.34 53.97 77.33 100.00 69.65 52.00 93.64 98.75

KM

100.00 33.22 64.92 68.06 77.29 97.05 68.67 99.20 98.83

ALL SL

100.00 35.42 40.68 49.52 100.00 65.15 69.33 99.90 99.50

AL

100.00 31.29 60.48 66.59 100.00 97.05 69.33 99.90 99.42

WR

100.00 33.16 69.80 80.78 100.00 97.05 94.00 99.90 99.42

We can see in the last line of the table that by combining the final data partitions

obtained in the WEACS approach when it uses the partitions of the ALL clustering

ensemble construction method and then by applying the Ward Link algorithm

(ALL+WR) to obtain the final data partition we obtain in 5 (Spiral, Std Yeast, Cigar,

Breast Cancer and Iris data sets) of the 9 data sets the best Ci results and in 2 other

data sets (Half Rings and Bars) the results obtained are very close to the best Ci re-

sults. In the Half Rings data set, the result obtained is 99,90% while the maximum

obtained is 100% and in the Bars data set the result obtained is 99,42% while the

maximum obtained is 99,50%. In Optical data set the result obtained is 80,78%, a

value inferior to the maximum obtained by other combination, 84,29%. However, this

result (80,78%) is close to the maximum obtained by the EAC approach (82,73%)

and much superior to the average value obtained by the EAC approach (58,55%) and

113

all other combination clustering ensemble approaches. In Log Yeast data set the result

obtained is 33,16%, a value inferior to the maximum obtained by other combination,

40,89%. This result (33,16%) is inferior to the maximum obtained by the EAC

approach (40,93%) and a litle inferior to the average value obtained by the EAC

approach (34,78%) and by both versions of WEACS (34.37% and 34.62%).

However, this result (33,16%) is superior to the average value obtained by the CL

(28,91%), AL (28,65%), CLR (31,58%), KM (30,98%) and Strehl (32,25%) methods.

Table 7 presents the percentage difference (the improvement in the accuracy) be-

tween the performance of the WEACS approach with the complementary step

(ALL+WR) and the average values obtained with the single application of the algo-

rithms, EAC, WEACS and Strehl approaches in each data set. The last column shows

the average improvement relatively to each single algorithm and each combination

clustering ensemble approach by using the WEACS approach with the complemen-

tary step (ALL+WR), over all data sets. In all approaches this improvement is supe-

rior to 10%, allowing concluding that this approach is robust and that could be fol-

lowed to obtain good clusterings. It can also be seen that in all data sets, with the

exception of Std Yeast data set, the values obtained by the WEACS approach with the

complementary step (ALL+WR) obtain always better values than the average of all

the other approaches.

Table 7. Percentage difference (improvement) between the performance of the WEACS ap-

proach with the complementary step (ALL+WR) and the average values obtained with the

single application of the algorithms, Strehl, EAC and WEACS approaches in each data set.

Spiral Log Yeast Std Yeast Optical Cigar Breast Iris Half Rings Bars Improve

SL

0.00 -1.74 33.60 70.18 39.60 31.90 26.00 4.90 49.17 28.18

CL

48.00 4.25 3.13 28.98 44.40 4.22 10.00 27.90 0.67 19.06

AL

48.00 4.51 3.91 5.08 12.80 2.76 3.33 26.50 0.67 11.95

CLR

42.60 1.58 7.27 6.98 28.80 1.44 4.63 23.35 2.36 13.22

KM

42.15 2.18 8.87 12.77 38.52 0.72 15.82 27.98 1.96 16.77

Strehl

31.63 0.91 18.21 11.34 31.89 17.02 0.28 6.86 3.38 13.50

EAC

28.22 -1.62 19.12 22.23 16.77 19.73 23.62 14.92 16.23 17.69

SWEACS

29.03 -1.21 17.84 23.55 17.87 19.50 22.16 16.66 18.71 18.23

JWEACS

29.17 -1.46 18.13 23.18 16.37 19.68 23.24 15.04 16.18 17.72

5 Conclusions

In this paper we present the WEACS approach that explores the subsampling to in-

crease the diversity of the clustering ensembles and extends the idea of EAC, propos-

ing the weighting of multiple clusterings by internal and relative validity indices.

Partitions in the clustering ensembles are produced by clustering subsamples of the

data set using K-means, Clarans, SL, CL and AL algorithms. We make use of two

different techniques to combine the clustering ensembles: using only the partitions

generated by a single algorithm with different initializations and/or parameters val-

ues; and using partitions generated by different clustering algorithms with different

initializations and/or parameters values. Using a voting mechanism, the partitions of

the cluster ensembles are weighted in the SWEACS version by an internal or relative

index to be incorporated in a w_co_assoc matrix; in the JWEACS version all internal

and relative indices contribute to weight each partition. The combined data partition

114

is achieved by clustering the w_co_assoc matrix using the K-means, SL, CL, AL or

WR algorithms. Experimental results with both synthetic and real data show that

SWEACS lead in general to better best results than the EAC and Strehl methods.

However, no individual WEACS configuration leads systematically to the best results

in all data sets. As a complementary step to the WEACS approach we combine all the

final data partitions obtained by the use of the ALL clustering ensemble construction

method. We use the EAC approach to do this combination and we use the Ward Link

algorithm to obtain the final data partition. We reach almost in all data sets the best

results or values very close to the best results.

These results show that the association of the subsampling and the weighting

mechanisms with cluster combination techniques lead to good results.

References

1. A. Fred, “Finding consistent clusters in data partitions,”. in Multiple Classifier Systems,

Josef Kittler and Fabio Roli editors, vol. LNCS 2096, Springer, 2001, pp. 309-318.

2. Fred A., Jain A. K., “Evidence accumulation clustering based on the k-means algorithm,”

in S.S.S.P.R, T.Caelli et al., editor,., Vol. LNCS 2396, Springer-Verlag, 2002, pp. 442 –

451

3. Fred and A.K. Jain, “Combining Multiple Clusterings using Evidence Accumulation,”

IEEE Transactions on Pattern analysis and Machine Intelligence, Vol. 27, No.6, June 2005,

pp. 835-850.

4. F.Jorge Duarte, Ana L.N. Fred, André Lourenço and M. Fátima C. Rodrigues, “Weighting

Cluster Ensembles in Evidence Accumulation Clustering”, Workshop on Extraction of

Knowledge from Databases and Warehouses, EPIA 2005.

5. F.Jorge F.Duarte, Ana L.N. Fred, André Lourenço and M. Fátima C. Rodrigues, “Weighted

Evidence Accumulation Clustering”, Fourth Australasian Conference on Knowledge Dis-

covery and Data Mining 2005.

6. A. Strehl and J. Ghosh, “Cluster ensembles - a knowledge reuse framework for combining

multiple partitions,” Journal of Machine Learning Research 3, 2002.

7. B. Park and H. Kargupta, Data Mining Handbook, chapter: Distributed Data Mining. Law-

rence Erlbaum Associates, 2003.

8. M. Meila and D. Heckerman, “An Experimental Comparison of Several Clustering and

Initialization Methods”, Proc. 14th Conf. Uncertainty in Artificial Intelligence, p.p. 386-

395, 1998.

9. M. Halkidi, Y. Batistakis, M. Vazirgiannis, "Clustering algorithms and validity measures",

Tutorial paper in the proceedings of the SSDBM 2001 Conference.

10. Theodorodis, S., Koutroubas, K., Pattern Recognition. Academic Press, 1999.

11. Hubert L.J., Schultz J., “Quadratic assignment as a general data-analysis strategy,” British

Journal of Mathematical and Statistical Psychology, Vol.29, 1975, pp. 190-241.

12. Dunn, J.C., “Well separated clusters and optimal fuzzy partitions,” J. Cybern, Vol. 4, 1974,

pp. 95-104.

13. Davies, D.L., Bouldin, D.W., “A cluster separation measure,”. IEEE Transaction on Pattern

Analysis and Machine Intelligence, Vol. 1, No2, 1979.

14. S.C. Sharma, Applied Multivariate Techniques, John Willwy & Sons, 1996.

15. Calinski, R.B.& Harabasz, J, “A dendrite method for cluster analysis,” Communications in

statistics 3, 1974, pp.1-27.

16. Kaufman, L. & Roussesseeuw, P., Finding groups in data: an introduction to cluster analy-

sis, New York, Wiley, 1990.

115

17. U. Maulik and S. Bandyopadhyay, “Performance Evaluation of Some Clustering Algo-

rithms and Validity Indices,” IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, Vol. 24, no. 12, 2002, pp. 1650-1654.

18. Xie, X.L., Beni, G., “A Validity Measure for Fuzzy Clustering,” IEEE Trans. Pattern

Analysis and Machine Intelligence, Vol. 13, 1991, pp. 841-847.

19. W. Krazanowski, Y. Lai, “A criterion for determining the number of groups in a dataset

using sum of squares clustering”, Biometrics, 1985, pp. 23-34.

20. J.A. Hartigan, “Statistical theory in clustering”, J. Classification, 1985, 63-76.

21. C.H. Chou, M.C. Su, E. Lai, “A new cluster validity measure and its application to image

compression”, Pattern Analysis and Applications, Vol. 7, 2004, pp. 205-220.

22. S.T. Hadjitodorov, L. I. Kuncheva, L. P. Todorova, Moderate Diversity for Better Cluster

Ensembles, Information Fusion, 2005, accepted

23. X. Z. Fern, C.E. Broadley, “Random projection for high dimensional data clustering: a

cluster ensemble approach”, 20th International Conference on Machine Learning,

ICML;Washington, DC, 2003, pp. 186-193.

24. S Monti; P. Tamayo; J. Mesirov; T. Golub, ”Consensus clustering: a resampling-based

method for class discovery and visualization of gene expression microarray data”, Machine

learning, 52, 2003, pp. 91-118.

25. A. Topchy, B. Minaei-Bidgoli, A.K. Jain, W. Punch, “Adaptive Clustering Ensembles”,

Proc. Intl. Conf on Pattern Recognition, ICPR’04, Cambridge, UK, 2004, pp. 272-275.

26. B. Minaei-Bidgoli, A. Topchy, W. Punch, “Ensembles of Partitions via Data Resampling”,

Proc. IEEE Intl. Conf. on Information Technology: Coding and Computing, ITCC04, vol.

2, April 2004, pp. 188-192.

27. E. Dimitriadou, A. Weingessel, K. Hornik, “Voting-Merging: An Ensemble Method for

Clustering”, Artificial Neural Networks – ICANN, August 2001.

28. Lourenço, A., Fred, “A. Comparison of Combination Methods using Spectral Clustering

Ensembles,” in Proc. Pattern Recognition on Information Systems, 2004.

116