A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR

SPECTRAL CLUSTERING

Nadia Farhanaz Azam and Herna L. Viktor

School of Electrical Engineering and Computer Science, University of Ottawa

800 King Edward Avenue, Ottawa, Ontario K1N6N5, Canada

Keywords:

Spectral clustering, Proximity measures, Similarity measures, Boundary detection.

Abstract:

A cluster analysis algorithm is considered successful when the data is clustered into meaningful groups so that

the objects in the same group are similar, and the objects residing in two different groups are different from

one another. One such cluster analysis algorithm, the spectral clustering algorithm, has been deployed across

numerous domains ranging from image processing to clustering protein sequences with a wide range of data

types. The input, in this case, is a similarity matrix, constructed from the pair-wise similarity between the data

objects. The pair-wise similarity between the objects is calculated by employing a proximity (similarity, dis-

similarity or distance) measure. It follows that the success of a spectral clustering algorithm therefore heavily

depends on the selection of the proximity measure. While, the majority of prior research on the spectral clus-

tering algorithm emphasizes the algorithm-speciﬁc issues, little research has been performed on the evaluation

of the performance of the proximity measures. To this end, we perform a comparative and exploratory anal-

ysis on several existing proximity measures to evaluate their suitability for the spectral clustering algorithm.

Our results indicate that the commonly used Euclidean distance measure may not always be a good choice

especially in domains where the data is highly imbalanced and the correct clustering of the boundary objects

are crucial. Furthermore, for numeric data, measures based on the relative distances often yield better results

than measures based on the absolute distances, speciﬁcally when aiming to cluster boundary objects. When

considering mixed data, the measure for numeric data has the highest impact on the ﬁnal outcome and, again,

the use of the Euclidian measure may be inappropriate.

1 INTRODUCTION

In recent years, a new family of cluster analysis algo-

rithms, collectively known as the Spectral Clustering

algorithms, has gained much interest in the research

community. One of the main strengths of the spectral

clustering algorithm is that the algorithm may be ap-

plied to a wide range of data types (i.e. numeric, cate-

gorical, binary, and mixed) as they are not sensitive to

any particular data type. These algorithms consider

the pair-wise similarity between the data objects to

construct a similarity (also known as proximity, afﬁn-

ity, or weight) matrix. The eigenvectors and eigen-

values of this similarity matrix are then used to ﬁnd

the clusters ((Luxburg, 2007), (Shi and Malik, 2000),

(Ng et al., 2001)). The various algorithms from this

family mainly differ with respect to how the similarity

matrix is manipulated and/or which eigenvalue(s) and

eigenvector(s) are used to partition the objects into

disjoint clusters. Signiﬁcant theoretical progress has

been made regarding the improvement of the spectral

clustering algorithms as well as the proposal of new

methods, or the application in various domains such

as image segmentation and speech separation. How-

ever, little research has been performed on the selec-

tion of proximity measures, which is a crucial step in

constructing the similarity matrix. In this paper, we

evaluate the performance of a number of such prox-

imity measures and perform an explorative study on

their behavior when applied to the spectral clustering

algorithms.

Proximity measures, i.e. similarity, dissimilar-

ity and distance measures, often play a fundamen-

tal role in cluster analysis (Jain et al., 1999). Early

steps of the majority of cluster analysis algorithms of-

ten require the selection of a proximity measure and

the construction of a similarity matrix (if necessary).

Most of the time, the similarity matrix is constructed

from an existing similarity or distance measure, or by

introducing a new measure speciﬁcally suitable for

a particular domain or task. It follows that the se-

lection of such measures, particularly when existing

Farhanaz Azam N. and Viktor H..

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING.

DOI: 10.5220/0003649000300041

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 30-41

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

measures are applied, requires careful consideration

as the success of these algorithms relies heavily on the

choice of the proximity function ((Luxburg, 2007),

(Bach and Jordan, 2003), (Everitt, 1980)).

Most of the previous studies on the spectral clus-

tering algorithm use the Euclidean distance mea-

sure, a distance measure based on linear differences,

to construct the similarity matrix for numeric fea-

ture type ((Shi and Malik, 2000), (Ng et al., 2001),

(Verma and Meila, 2001)) without explicitly stating

the consequences of selecting the distance measure.

However, there are several different proximity mea-

sures available for numeric variable types. Each of

them has their own strengths and weaknesses. To

our knowledge, no in-depth evaluation of the perfor-

mance of these proximity measures on spectral clus-

tering algorithms, speciﬁcally showing that the Eu-

clidean distance measure outperforms, has been car-

ried out. As such, an evaluation and an exploratory

study that compares and analyzes the performance of

various proximity measures may potentially provide

important guideline for researchers when selecting a

proximity measure for future studies in this area. This

paper endeavors to evaluate and compare the perfor-

mance of these measures and to imply the conditions

under which these measures may be expected to per-

form well.

This paper is organized as follows. In Section 2,

we discuss the two spectral clustering algorithms that

we used in our experiment. Section 3 presents an

overview of several proximity measures for numeric,

and mixed variable types. This is followed by Section

4, where we present our experimental approach and

evaluate and analyze the results obtained from our ex-

periments. We conclude the paper in Section 5.

2 SPECTRAL CLUSTERING

Spectral clustering algorithms originated from the

area of graph partitioning and manipulate the eigen-

value(s) and eigenvector(s) of the similarity matrix to

ﬁnd the clusters. There are several advantages, when

compared to other cluster analysis methods, to ap-

plying the spectral clustering algorithms ((Luxburg,

2007), (Ng et al., 2001), (Aiello et al., 2007), (Fis-

cher and Poland, 2004)). Firstly, the algorithms do

not make assumption on the shape of the clusters. As

such, while spectral clustering algorithms may be able

to ﬁnd meaningful clusters with strongly coherent ob-

jects, algorithms such as K-means or K-medians may

fail to do so. Secondly, the algorithms do not suffer

from local minima. Therefore, it may not be neces-

sary to restart the algorithm with various initialization

options. Thirdly, the algorithms are also more stable

than some algorithms in terms of initializing the user-

speciﬁc parameters (i.e. the number of clusters). As

such, the user-speciﬁc parameters may often be esti-

mated accurately with the help of theories related to

the algorithms. Prior studies also show that the al-

gorithms from this group thus often outperform tra-

ditional clustering algorithms, such as, K-means and

Single Linkage (Luxburg, 2007). Importantly, the al-

gorithms from the spectral family are able to handle

different types of data (i.e. numeric, nominal, binary,

or mixed) and one only needs to convert the dataset

into a similarity matrix to be able to apply this algo-

rithm on a given dataset (Luxburg, 2007).

The spectral clustering algorithms are divided into

two types, namely recursive algorithms and multi-

way algorithms (Verma and Meila, 2001). In this

paper, we consider two algorithms, one from each

group. From the ﬁrst group, we select the normal-

ized cut spectral clustering algorithm as this algorithm

proved to have had several practical successes in a va-

riety of ﬁelds (Shi and Malik, 2000). We refer to this

algorithm as SM (NCut) in the remainder of the pa-

per. The Ng, Jordan and Weiss algorithm is an im-

provement to the algorithm proposed by Meila and

Shi(Meila and Shi, 2001) and therefore, we select this

algorithm (refer to as NJW(K-means)) from the sec-

ond group. In the following section we present several

algorithm-speciﬁc notations before we discuss the al-

gorithms themselves.

2.1 Notations

Similarity Matrix or Weight Matrix, W . Let W

be an N × N symmetric, non-negative matrix where

N is the number of objects in a given dataset. Let i

and j be any two objects in a given dataset, located

at row i and row j, respectively. If the similarity (i.e.

calculated from a proximity measure) between these

two objects is w

i, j

, then it will be located at the cell at

row i and column j in the weight matrix.

Degree Matrix, D. Let d be an N × 1 matrix with

∑

j=1

i, j

as the entries which denote the total

similarity value from object i to the rest of the objects.

Therefore, the degree matrix D is an N × N diagonal

matrix which contains the elements of d on its main

diagonal.

Laplacian Matrix, L. The Laplacian matrix is con-

structed from the weight matrix W and the degree ma-

trix D. The main diagonal of this matrix is always

non-negative. In graph theory, the eigenvector(s) and

eigenvalue(s) of this matrix contain important infor-

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

mation about the underlying partitions present in the

graph. The spectral clustering algorithms also use

the same properties to ﬁnd the clusters from a given

dataset.

2.2 The SM (NCut) Algorithm

The SM (NCut) spectral clustering algorithm (Shi and

Malik, 2000) is one of the most widely used recur-

sive spectral clustering algorithm ((Luxburg, 2007),

(Verma and Meila, 2001)). The main intuition behind

this algorithm is the optimization of an objective func-

tion called the Normalized Cut, or NCut. Minimizing

the NCut function is the same as ﬁnding a cut such

that the total connection in between two groups is

weak, whereas the total connection within each group

is strong. The algorithm uses the eigenvector associ-

ated with the second smallest eigenvalue of the gen-

eralized eigenvalue system which is considered as the

real valued solution to the Normalized Cut problem.

The partitions are obtained by thresholding this eigen-

vector. There are a number of ways this grouping may

be performed. One may use a particular point (i.e.

zero, mean, median) as the splitting criteria or use an

existing algorithm such as the K-means or K-medians

algorithms for this purpose. Components with simi-

lar values usually reside in the same cluster. Since,

this algorithm bi-partition the data, we get two dis-

joint clusters. To ﬁnd more clusters we need to re-

partition the segments by recursively applying the al-

gorithm on each of the partitions.

2.3 The NJW (K-means) Algorithm

In contrast to the SM (NCut) algorithm that min-

imizes the NCut objective function and recursively

bi-partitions the data, this algorithm directly parti-

tions the data into k groups. The algorithm manip-

ulates the normalized Laplacian matrix to ﬁnd the

clusters. The algorithm relates various theories from

the Random Walk Problem and Matrix Perturbation

Theory to theoretically motivate the partitioning solu-

tion ((Luxburg, 2007), (Ng et al., 2001), (Meila and

Shi, 2001)). Once the eigensystem is solved and the

k largest eigenvectors are normalized, the algorithm

uses the K-means algorithm to ﬁnd the k partitions.

3 PROXIMITY MEASURES

Proximity measures quantify the distance or close-

ness between two data objects. They may be sub-

categorized into three types of measures, namely sim-

ilarity, dissimilarity, and distance.

Similarity is a numerical measure that represents

the similarity (i.e. how alike the objects are) between

two objects. This measure usually returns a non-

negative value that falls in between 0 and 1. How-

ever, in some cases similarity may also range from −1

to +1. When the similarity takes a value 0, it means

that there is no similarity between the objects and the

objects are very different from one another. In con-

trast, 1 denotes complete similarity, emphasizing that

the objects are identical and possess the same attribute

values.

The dissimilarity measure is also a numerical

measure, which represents the discrepancy or the dif-

ference between a pair of objects (Webb, 2002). If

two objects are very similar then the dissimilarity

measure will have a lower value, and visa versa.

Therefore, this measure is reversely related to the sim-

ilarity measure. The dissimilarity value also usually

fall into the interval [0, 1], but it may also take values

ranging from −1 to +1.

The term distance, which is also commonly used

as a synonym for the dissimilarity measure, com-

putes the distance between two data points in a multi-

dimensional space. Let d(x,y) be the distance be-

tween objects x and y. Then, the following four prop-

erties hold for a distance measure ((Larose, 2004),

(Han and Kamber, 2006)):

1. d(x,y) = d(y,x), for all points x and y.

2. d(x,y) = 0, if x = y.

3. d(x,y) ≥ 0, for all points x and y.

4. d(x,y) ≤ d(x, z) + d(z, y), for all points x, y and

z. This implies that introducing a third point

may never shorten the distance between two other

points.

There are many different proximity measures

available in the literature. One of the reasons for this

variety is that these measures differ on the data type

of the objects in a given dataset. Next we present the

proximity measures that are used in this paper.

3.1 Proximity Measures for Numeric

Variables

Table 1 presents the measures for numeric, real-

valued or continuous variables used in our paper

((Webb, 2002), (Teknomo, 2007)). These measures

may be categorized into three groups. The ﬁrst group

contains the functions that measure the absolute dis-

tance between the objects and are scale dependent.

This list includes the Euclidean (EUC), Manhattan

(MAN), Minkowski (MIN), and Chebyshev (CHEB)

distances. The second group contains only the Can-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

Table 1: Proximity measures for numeric variables.

Name Function Discussion

Euclidean

Distance (EUC)

∑

k=1

− x

)

Works well for compact or isolated clusters; Discovers clusters of spher-

ical shape; Any two objects may not be inﬂuenced by the addition of

a new object (i.e. outliers); Very sensitive to the scales of the vari-

ables; Not suitable for clusters of different shapes; The variables with

the largest values may always dominate the distance.

Manhattan Dis-

tance (MAN)

∑

k=1

− x

| Computationally cheaper than the Euclidean distance; Scale dependent.

Minkowski Dis-

tance (MIN)



∑

k=1

− x



One may control the amount of emphasis given on the larger differ-

ences; The Minkowski distance may cost more than the Euclidean and

Manhattan distance when λ > 2.

Chebyshev Dis-

tance (CHEB)

= max

− x

| Suitable for situations where the computation time is very crucial; Very

sensitive to the scale of the variables.

Canberra Dis-

tance (CAN)

∑

k=1

−x

Not scale sensitive; Suitable for non-negative values; Very sensitive to

the changes near the origin; Undeﬁned when both the coordinates are 0.

Mahalanobis

Distance

(MAH)

d(~x,~y) =

(~x −~y)C

−1

(~x −~y)

Considers the correlation between the variables; Not scale dependent;

Favors the clusters of hyper ellipsoidal shape; Computational cost is

high; May not be suitable for high-dimensional datasets.

Angular Dis-

tance (COS)

= 1 −

∑

k=1

·x

(

∑

k=1

∑

k=1

)

Calculates the relative distance between the objects from the origin;

Suitable for semi-structured datasets (i.e. Widely applied in Text Doc-

ument cluster analysis where data is highly dimensional); Does not de-

pend on the vector length; Scale invariant; Absolute distance between

the data objects is not captured.

Pearson Corre-

lation Distance

(COR)

i j

= 1−

∑

k=1

− ¯x

)·(x

− ¯x

)

(

∑

k=1

− ¯x

)

∑

k=1

− ¯x

)

Scale invariant; Considers the correlation between the variables; Cal-

culates the relative distance between the objects from the mean of the

data; Suitable for semi-structured data analysis (i.e. applied in microar-

ray analysis, document cluster analysis); Outliers may affect the results.

berra distance (CAN) which also calculates the abso-

lute distance, however, the measure is not scale de-

pendent. In the third group we have three distance

measures, namely the Angular or Cosine (COS), Pear-

son Correlation (COR), and Mahalanobis (MAH) dis-

tances. These measures consider the correlation be-

tween the variables into account and are scale invari-

ant.

3.2 Proximity Measures for mixed

Variables

In the previous section, we concentrated our discus-

sion on datasets with numeric values. Nevertheless,

in practical applications, it is often possible to have

more than one type of attribute in the same dataset. It

follows that, in such cases, the conventional proximity

measures for those data types may not work well. A

more practical approach is to process all the variables

of different types together and then perform a sin-

gle cluster analysis (Kaufman and Rousseeuw, 2005).

The Gower’s General Coefﬁcient and Laﬂin’s General

Coefﬁcient are two such functions that incorporate in-

formation from various data types into a single simi-

larity coefﬁcient. Table 2 provides the equations and

additional information about these two coefﬁcients.

4 EXPERIMENTS

This section discusses our experimental methodology

and the results obtained for each of the data types.

In order to compare the performance of the proximity

measures for a particular data type, we performed ten-

fold cross validation (Costa et al., 2002) and classes

to clusters evaluation on each of the datasets. In

this paper, we consider the external measures when

evaluating our results, as the external class labels for

each of the datasets used were available to us. More-

over, this allows us to perform a fair comparison

against the known true clusters for all the proximity

measures. We used two such measures namely, F-

measure ((Paccanaro et al., 2006), (Steinbach et al.,

2000)) and G-means (Kubat et al., 1998). In addi-

tion, we used the Friedman Test (Japkowicz and Shah,

2011) to test the statistical signiﬁcance of our results.

The test is best suited for situations, like ours, where

multiple comparisons are performed against multiple

datasets. It returns the p-value that helps us to deter-

mine whether to accept or reject the null hypothesis.

For us the null hypothesis was “the performance of

all the proximity measure is equivalent”. For exam-

ple, a p-value less than 0.05 signify that the result is

only 5% likely to be extraordinary ((Japkowicz and

Shah, 2011), (Boslaugh and Watters, 2008)).

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

Table 2: Proximity measures for mixed variables.

Name Function Discussion

Gower’s General Coefﬁ-

cient (GOWER)

d(i, j) =

∑

f =1

( f )

i j

( f )

i j

∑

f =1

( f )

i j

For this coefﬁcient δ

( f )

i j

= 0, if one of the values is missing or

the variable is of type asymmetric binary variable. Otherwise,

( f )

i j

= 1. For numeric variables distance is calculated using the

formula: d

( f )

i j

i f

−x

j f

max

h f

−min

h f

. For binary and nominal variables

the dissimilarity is the number of un matched pairs. This measure

may be extended to incorporate other attribute types (e.g. ordinal,

ratio-scaled) (Han and Kamber, 2006).

Laﬂin’s General Similar-

ity Coefﬁcient (LAFLIN)

s(i, j) =

+...+N

In this function, N

...N

represent the total number of attributes

of each of the variable type, whereas, s

...s

represent the total

similarity measure calculated for each of the attribute type. The

function uses existing similarity measures to calculate the similar-

ity scores s

...s

Table 3: Dataset Information.

Numeric Datasets

Dataset No. of No. of True No. of

Tuples Clusters Attributes

Body 507 2 21

Iris 150 3 4

Wine 178 3 13

Glass 214 6 8

Ecoli 336 5 7

SPECT 267 2 44

Mixed Datasets

Dataset No. of No. of True No. of

Tuples Clusters Attributes

Automobile 205 6 25

CRX 690 2 15

Dermatology 366 6 33

Hepatitis 155 2 19

Post-Operative 90 3 8

Soybean 290 15 35

4.1 Implementation and Settings

For all the experiments in this paper, the data prepro-

cessing (e.g. replacing missing values, standardiza-

tion) was performed using WEKA (Witten and Frank,

2005), an open-source Java-based machine learning

software, developed at the University of Waikato

in New Zealand. The spectral cluster analysis al-

gorithms are implemented in MATLAB



. Since

these algorithms manipulate the similarity matrix of

a dataset, computation of eigenvalues and eigenvec-

tors of the similarity matrix may be inefﬁcient for a

large matrix. However, MATLAB efﬁciently solves

the eigensystem of large matrices. The cluster evalu-

ation measures have been implemented in Java.

4.2 Datasets

Six datasets are used for each of the data types. All

the datasets varied in size and are based on real-world

problems representing various domains and areas. As

for numeric data type, several proximity measures are

scale-dependent. As such, we standardized (Han and

Kamber, 2006) the attribute values to ensure the ac-

curacy of our results. In addition, we used Equation 1

to convert the distance measure into a similarity mea-

sure in order to create the similarity or weight matrix

W (Shi and Malik, 2000). It is also important to note

that the selection of sigma is very crucial to the suc-

cess of spectral clustering algorithm and the value of

sigma varies depending on the proximity measure and

the dataset used. We adopted the method proposed by

Shi and Malik in (Shi and Malik, 2000) to select the

sigma value. The sigma returned through this method

was used as a starting point. Each of our experiments

were performed on a range of values surrounding that

sigma value and the result for which we achieved the

best scores are included in this paper.

s(x,y) = exp(

−d(x, y)

2 × σ

) (1)

Five of our datasets are from UCI repository (Asun-

cion and Newman, 2007) and one of them (Body

dataset) is from an external source (Heinz et al.,

2003). In Table 3, we provide a summary of the nu-

meric datasets. The six datasets used for the mixed

variable type are also obtained from the UCI reposi-

tory. A summary of the mixed datasets is presented in

Table 3.

4.3 Experimental Results for Numeric

Datasets

In Table 4 and Table 5 we present the F-measure and

G-means scores obtained for the datasets with nu-

meric variables when SM (NCut) algorithm is used.

Our results show that the COR distance and the COS

distance measure often scored higher than the rest of

the distance measures (Figure 1). These two distance

measures performed well in four out of six datasets.

The datasets are Body, Iris, Wine, and Glass. We also

notice that most of the time, these two coefﬁcients

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

achieved similar values for both the evaluation mea-

sures. The overall average difference for these two

distance measures is 0.02, irrespective of the dataset

or the splitting method used. In contrast, the MAH

distance measure performed poorly in four out of six

datasets. The datasets for which this distance measure

scored the lowest are Body, Iris, Wine, and Glass. The

MAN distance performed well for the Ecoli dataset

and the CAN distance performed best for the SPECT

dataset. We also notice that the performances of EUC,

MIN, MAN, CAN and CHEB distances are very sim-

ilar, and that they often scored moderately, in com-

parison to the highest and the lowest scores. For

example, based on the scores obtained for the Body

dataset, the distance measures may be grouped into

three groups: 1) the COS distance and COR distance

measure in one group where the scores fall in the

range [0.96 − 0.97], 2) the CAN, EUC, MIN, MAN

and CHEB distance measures in the second group

where the range is [0.85 − 0.90], and 3) the MAH dis-

tance measure which scores the lowest (0.68). As ob-

served from the results, the EUC distance measure,

which is often used in the spectral cluster analysis al-

gorithms, may not always be a suitable choice.

Figure 1: F-measure scores for the numeric dataset when

tested on the SM (NCut) algorithm.

We observe that, if the COS distance or the COR

distance measure is used instead of the EUC distance,

on average the performance improved by 7.42% (F-

measure) and 8.17% (G-means), respectively, for our

datasets. In Table 6 and Table 7, we provide the eval-

uation scores from the NJW (K-means) algorithm. In

this case also, the results showed almost the same

trend as the results from the SM (NCut) algorithm.

For both the evaluation measures, the results indicate

that the MAH distance measure often scored the low-

est scores over a range of datasets. Among the six

datasets, in ﬁve of the cases the MAH distance scored

the lowest scores. These datasets are: Body, Iris,

Wine, Glass, and Ecoli. Furthermore, in none of the

cases, the EUC distance measure scored the highest

score. For two datasets (i.e. Body and Iris), the COR

distance and the COS distance performed well, and

for the Wine and Glass datasets, the scores were very

close to the highest scores achieved. The MAN dis-

tance performed well for the Ecoli and Glass datasets.

The Friedman Test which was used to measure the

statistical signiﬁcance of our results, gives p-values of

0.0332 (SM (NCut)) and 0.0097 (NJW (K-means)),

respectively. Since the p-values are less than 0.05,

this indicates the results are statistically signiﬁcant.

Our results showed that the MAH distance often per-

formed poorly when compared to the rest of the dis-

tance measures according to the cluster evaluation

measures. We noticed that, when the MAH distance is

used, the spectral clustering algorithms produced im-

balanced clusters. Here the clusters are imbalanced

when one partition contains relatively fewer objects

than the other cluster. We also noticed that the objects

that are placed in the smaller cluster are the objects

that have the lowest degree. Recall from Section 2

that the degree is the total similarity value from one

object to the rest of the objects in a dataset. In spec-

tral clustering, the objects are considered as nodes in

the graph, and a partition separates objects where the

total within cluster similarity is high and the between

cluster similarity is very low. Therefore, when the

degree is low for an object, compared to the rest of

the objects, it indicates that the object is less simi-

lar than most of the objects in the dataset. Now, the

equation of the MAH distance deﬁnes an ellipsoid in

n-dimensional space ((Lee and Verri, 2002), (Abou-

Moustafa and Ferrie, 2007)). The distance considers

the variance (how spread out the values are from the

mean) of each attribute as well as the covariance (how

much two variables change together) of the attributes

in the datasets. It gives less weight to the dimensions

with high variance and more weight to the dimen-

sions with small variance. The covariance between

the attributes allows the ellipsoid to rotate its axes

and increase and decrease its size (Abou-Moustafa

and Ferrie, 2007). Therefore, the distance measure is

very sensitive to the extreme points (Filzmoser et al.,

2005).

Figure 2: A scenario depicting the Mahalanobis (MAH) dis-

tance between three points.

Figure 2 illustrates a scenario showing the MAH

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

Table 4: F-measure scores for Numeric datasets. Algorithm: SM (NCut), Splitting points: Zero and Mean value respectively.

Dataset COS COR CAN EUC MIN MAN CHEB MAH

Body 0.97 0.96 0.90 0.87 0.87 0.87 0.85 0.68

0.97 0.96 0.90 0.84 0.87 0.87 0.85 0.68

Iris 0.97 0.96 0.94 0.90 0.88 0.86 0.88 0.79

0.97 0.96 0.94 0.89 0.88 0.86 0.89 0.78

Wine 0.91 0.95 0.84 0.81 0.84 0.90 0.87 0.79

0.88 0.90 0.85 0.89 0.85 0.89 0.89 0.64

Glass 0.62 0.61 0.61 0.61 0.60 0.61 0.59 0.60

0.64 0.63 0.64 0.61 0.59 0.61 0.57 0.57

Ecoli 0.81 0.79 0.83 0.82 0.83 0.85 0.82 0.81

0.81 0.80 0.80 0.86 0.85 0.86 0.85 0.82

SPECT 0.80 0.77 0.81 0.80 0.80 0.81 0.80 0.80

0.79 0.77 0.82 0.81 0.79 0.81 0.81 0.80

Table 5: G-means scores for Numeric datasets. Algorithm: SM (NCut), Splitting points: Zero and Mean value respectively.

Dataset COS COR CAN EUC MIN MAN CHEB MAH

Body 0.97 0.96 0.90 0.87 0.87 0.87 0.85 0.71

0.97 0.97 0.90 0.83 0.84 0.87 0.85 0.71

Iris 0.97 0.96 0.94 0.90 0.88 0.87 0.88 0.80

0.97 0.96 0.94 0.90 0.89 0.87 0.89 0.80

Wine 0.92 0.95 0.86 0.83 0.86 0.91 0.87 0.80

0.89 0.91 0.86 0.90 0.86 0.90 0.89 0.67

Glass 0.65 0.64 0.65 0.66 0.64 0.64 0.64 0.62

0.66 0.65 0.67 0.64 0.62 0.64 0.61 0.60

Ecoli 0.82 0.80 0.83 0.82 0.83 0.86 0.83 0.81

0.81 0.81 0.81 0.86 0.86 0.86 0.85 0.82

SPECT 0.81 0.80 0.82 0.82 0.81 0.81 0.80 0.82

0.81 0.80 0.83 0.82 0.82 0.82 0.81 0.81

Table 6: F-measure scores of the NJW (K-means) algorithm (tested on numeric dataset).

Dataset COS COR CAN EUC MIN MAN CHEB MAH

Body 0.88 0.88 0.83 0.79 0.79 0.80 0.78 0.64

Iris 0.81 0.82 0.79 0.78 0.80 0.78 0.79 0.69

Wine 0.94 0.96 0.97 0.85 0.97 0.86 0.83 0.50

Glass 0.54 0.54 0.53 0.55 0.56 0.58 0.54 0.50

Ecoli 0.65 0.72 0.63 0.74 0.73 0.75 0.65 0.48

SPECT 0.77 0.77 0.77 0.77 0.77 0.77 0.76 0.77

Table 7: G-means scores of the NJW (K-means) algorithm (tested on numeric dataset).

Dataset COS COR CAN EUC MIN MAN CHEB MAH

Body 0.88 0.88 0.83 0.79 0.79 0.80 0.78 0.67

Iris 0.81 0.83 0.79 0.79 0.80 0.79 0.79 0.69

Wine 0.94 0.96 0.97 0.86 0.97 0.87 0.84 0.57

Glass 0.56 0.55 0.56 0.58 0.58 0.62 0.56 0.55

Ecoli 0.69 0.74 0.68 0.75 0.75 0.77 0.68 0.52

SPECT 0.80 0.80 0.80 0.80 0.80 0.80 0.79 0.80

distance between the objects. In this ﬁgure, the dis-

tance between object 1 and 2 will be less than the dis-

tance between object 1 and 3, according to the MAH

distance. This is because, object 2 lies very close to

the main axes along with the other objects, whereas

the object 3 lies further away from the main axes.

Therefore, in such situations, the MAH distance will

be large. For numeric data, the similarity will be very

low when the distance is very large. The function in

Equation 1, which is used to convert a distance value

into a similarity value, will give a value close to zero

when the distance is very large. Therefore, the de-

gree from this object to the remainder of the objects

becomes very low and the spectral methods separate

these objects from the rest. This is one of the possible

reasons for the MAH distance performing poorly. It

either discovers imbalanced clusters or places similar

objects wrongly into two different clusters. Conse-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

quently, one possible way to improve the performance

of the MAH distance measure might be by changing

the value of σ to a larger value. In this way, we may

prevent the similarity to have a very small value.

Figure 3: Example of cluster assignments of the Body

dataset. The circles are used to point to the several indi-

vidual members that are placed differently.

Our results also indicate that the COR and COS

distances performed best for four out of six datasets.

Both of the distance measures calculate the relative

distance from a ﬁxed point (mean or zero, respec-

tively). Therefore, two objects with a similar pat-

tern will be more similar even if their sizes are differ-

ent. The EUC, MAN, MIN, CAN and CHEB distance

measures, however, calculate the absolute distance

(i.e. straight line distance from one point to another).

For instance, the Body dataset partitions the objects

into two main clusters, one with larger body dimen-

sions and another cluster with smaller body dimen-

sions. According to the true cluster information, the

larger body dimensions denote the Male population

and the smaller body dimensions denote the Female

population. When compared to the true clusters, we

observed that several individuals, whose body dimen-

sions are comparatively lower than the average body

dimensions of the Male population, are placed with

the individuals from the Female population by the dis-

tance measures that calculate the absolute distance.

Conversely, Female individuals with larger body di-

mensions than the average body dimensions of the Fe-

male population are placed with the individuals from

Male population. Therefore, these individuals that fall

very close to the boundary of the two true clusters,

are placed differently by the distance measures that

calculate the absolute distance (i.e. EUC distance,

MAN distance, and MIN distance) than the distance

measures that consider the relative distance (i.e. COR

distance and COS distance). In such cases, the COS

distance and the COR distance correctly identify these

individuals. In Figure 3 we plot the ﬁrst two attributes

of the Body dataset when the EUC distance is used

as the distance measure. The object marked with a

smaller circle is an example of a Male individual with

smaller body dimensions. When the EUC distance is

used as the distance measure in the spectral cluster-

ing algorithm, this object is placed with the Female

population. The objects marked with the larger circle

illustrate the reversed situation, where Female indi-

viduals with larger body size are placed with the in-

dividuals from Male population. In both situations,

the COR and COS distances placed the objects within

their own groups. In Figure 4, we provide the clus-

ters from the Ecoli dataset when the MAN (Left) and

the COS (Middle) distances are used. The farthest

right ﬁgure (with the title Original) depicts the true

clusters. The objects, according to the true clusters,

overlap between the clusters in a number of situations

(e.g. the objects marked with circle 2 and 4, or the

objects marked with circle 3 and 5). This indicates

that there are several objects in the dataset that may

be very similar, but are placed in two different true

clusters. When the spectral clustering algorithms are

applied to this dataset, both the COS and MAN dis-

tances divide the true cluster marked with circle 1 (in

Figure 4) into two different clusters. However, the

clusters produced by the COS distance contain mem-

bers from true cluster 1 and 3, whereas the clusters

produced by the MAN distance contain the members

from true cluster 1. The ﬁgure indicates that the shape

of the clusters produced by the COS distance are more

elongated toward the origin, which is the reason why

some of the members from true cluster 3 are included.

4.3.1 Discussion

The main conclusion drawn from our results thus in-

dicate that the MAH distance needs special considera-

tion. This measure tends to create imbalanced clusters

and therefore results in poor performance. In addi-

tion, the distance measures based on the relative dis-

tances (i.e. COR and COS distance measure) outper-

formed the distance measures based on the absolute

distance. We noticed that, in such cases, the objects

that reside in the boundary area are correctly identi-

ﬁed by the relative distance measures. These bound-

ary objects are slightly different from the other mem-

bers of their own group and may need special atten-

tion. This is due to the fact that the COR and COS

measures consider the underlying patterns in between

the objects from a ﬁxed point (i.e. mean or zero), in

contrast to the absolute distance approaches. There-

fore, the Euclidian (EUC) distance which is a com-

monly used absolute distance measure in clustering

domains, may not always be a good selection for the

spectral clustering algorithm, especially in domains

where we are sensitive to outliers and anomalies.

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

Figure 4: Example of cluster assignments of the Ecoli dataset. (Left) The clusters obtained by using the MAN distance

measure, (Middle) the clusters obtained by using the COS distance, and (Right) the original true clusters.

Figure 5: Average F-measure scores for the Mixed Datasets

when tested on the SM(NCut) algorithm.

4.4 Experimental Results for mixed

Datasets

In Table 8, we present the F-measure and G-means

scores for the Mixed datasets, when tests are applied

on the SM (NCut) algorithm. The results from NJW

(K-means) algorithm are given in Table 9. Figure 5

presents a graphical representation of our results. Our

results from the external evaluation scores show that,

the GOWER coefﬁcient performed well for the Auto-

mobile and Dermatology dataset. The LAFLIN’s co-

efﬁcient also performed well for two of the datasets.

The datasets are CRX and Hepatitis. For Post Opera-

tive and Soybean datasets, both the coefﬁcients scored

the same scores. In contrast, when the tests are ap-

plied on the NJW (K-means) algorithm, our results in-

dicate that the GOWER coefﬁcient performed slightly

better than the LAFLIN’s. In four out of six datasets

the GOWER scored slightly higher scores than the

LAFLIN’s coefﬁcient. The datasets are Automobile,

Dermatology, Post Operative, and Soybean. For this

case also, the LAFLIN’s coefﬁcient performed best

for the same two datasets (i.e. CRX and Hepatitis) as

our previous test on the SM (NCut) algorithm. How-

ever, we also noticed from the scores that the differ-

ence between the performances of the two coefﬁcients

is very low. The p-values from the Friedman test are

0.3173 and 0.4142, respectively. Since, both the val-

ues are greater than 0.05, the difference between the

performance of the two coefﬁcients is not statistically

signiﬁcant.

In this part, we analyze the coefﬁcients to deter-

mine the relationship between them. The equations

for the two coefﬁcients are given in Table 2. For the

GOWER coefﬁcient, the term δ

( f )

i j

is an indicator vari-

able associated with each of the variables present in

the dataset and the term d

( f )

i j

is the distance or dissim-

ilarity calculated for each variable for objects i and j.

We also know from the description given in Table 2

that δ

( f )

i j

= 0 for the asymmetric binary variables and

for all the other types δ

( f )

i j

= 1. In our datasets, all

of the attributes are numeric, nominal, or symmetric

binary. Therefore, the denominator of the GOWER’s

equation represents the total number of variables in

the dataset. Also, recall that the GOWER coefﬁcient

is a dissimilarity measure, where the dissimilarity be-

tween the two objects, i and j, falls in between 0 and

1. This equation is converted into a similarity mea-

sure by subtracting from 1. Therefore, the equation

for the GOWER similarity coefﬁcient is:

s(i, j) = 1 −

∑

f =1

( f )

i j

( f )

i j

∑

f =1

( f )

i j

(2)

Let N =

∑

f =1

( f )

i j

be the total number of attributes and

for each attribute δ

i j

= 1, then Equation 2 becomes,

s(i, j) = 1 −

∑

f =1

1 ∗ d

( f )

i j

N −

∑

f =1

( f )

i j

(3)

Notice from the equation of LAFLIN’s coefﬁcient, s

is the total similarity value of attribute type i, and N

is the total number of variables of attribute type i. In

our datasets, the attribute types are numeric (N

and

), nominal (N

and s

), and symmetric binary (N

and s

). Therefore, the equation becomes,

s(i, j) =

+ N

(4)

Notice that in Equation 4, the denominator is the to-

tal number of attributes in a given dataset, which we

previously denoted as N. Therefore, Equation 4 is the

same as the following equation:

s(i, j) =

N − (N

− N

+ N

− N

+ N

− N

)

(5)

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

Table 8: F-measure and G-means scores for Mixed datasets. Algorithm: SM (NCut), Splitting points: zero and mean value,

respectively.

F-measure G-means

Dataset GOWER LAFLIN Dataset GOWER LAFLIN

Automobile 0.46 0.44 Automobile 0.49 0.48

0.46 0.46 0.50 0.49

CRX 0.76 0.79 CRX 0.76 0.79

0.76 0.79 0.76 0.79

Dermatology 0.85 0.87 Dermatology 0.86 0.87

0.84 0.82 0.85 0.83

Hepatitis 0.71 0.73 Hepatitis 0.74 0.75

0.73 0.75 0.75 0.77

Post Operative 0.56 0.56 Post Operative 0.57 0.57

0.56 0.56 0.58 0.57

Soybean 0.70 0.70 Soybean 0.72 0.72

0.70 0.70 0.72 0.72

Table 9: F-measure and G-means scores from the NJW (K-means) algorithm (tested on mixed dataset).

F-measure G-means

Dataset GOWER LAFLIN GOWER LAFLIN

Automobile 0.47 0.45 0.48 0.46

CRX 0.76 0.80 0.76 0.80

Dermatology 0.84 0.82 0.86 0.84

Hepatitis 0.71 0.74 0.74 0.76

Post Operative 0.52 0.47 0.53 0.49

Soybean 0.57 0.53 0.61 0.56

Figure 6: Comparison of numeric functions on Iris dataset. (From left) the clusters obtained from the true clusters, the clusters

obtained from the numeric function of the GOWER coefﬁcient, and the clusters obtained from the numeric function of the

LAFLIN coefﬁcient.

Equation 5 can be re-written as:

s(i, j) =

N − (N

(1 − s

) + N

(1 − s

) + N

(1 − s

))

(6)

At this point, the GOWER equation given in Equation

3 and the LAFLIN’s coefﬁcient given in Equation 6,

both have similar patterns. They have the same de-

nominator. However, they differ only in the terms in

numerator. As mentioned previously, d

i j

is the dis-

tance or dissimilarity between the two objects i and

j, whereas, (1 − s

) is also a dissimilarity measure.

Both functions handle nominal and binary variables

in the same way. Therefore, this implies that the dif-

ference in the equations occurs due to the functions

selected for the numeric attributes which are handled

differently by the two coefﬁcients. This is one of the

reasons that the difference between the performances

of both of the coefﬁcients is very low. In our tests,

we used the Euclidean distance for the LAFLIN’s co-

efﬁcient, whereas, the GOWER coefﬁcient uses the

distance measure given in Equation 7.

( f )

i j

i f

− x

j f

max

h f

− min

h f

(7)

We use the two numeric functions with the spec-

tral clustering algorithms and apply them on the Iris

dataset from the UCI repository (Bach and Jordan,

2006) to evaluate their performances. Figure 6 il-

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING

lustrates the clusters obtained from the true clusters

(left), the clusters obtained from the numeric function

of the GOWER coefﬁcient (middle), and the clusters

obtained from the numeric function of the LAFLIN

coefﬁcient (right). We notice that both of the mea-

sures correctly cluster the objects from true cluster

1. However, the difference between them is clear in

true cluster 2 and cluster 3. Notice that these two true

clusters have objects that overlap near the boundary

of the clusters. The objects located at the boundary

usually have attribute values slightly different from

the other members of their own true clusters. The nu-

meric function for the GOWER coefﬁcient correctly

distinguishes several objects near the boundary. How-

ever, the LAFLIN coefﬁcient, which used the Eu-

clidean distance to compute the distance between the

objects, placed the objects which are located near the

boundary, in two different clusters. We notice that the

clusters formed from this measure have a shape simi-

lar to a sphere. This may be the reason for this mea-

sure performing slightly differently than the function

of the GOWER coefﬁcient.

4.4.1 Discussion

In summary, under certain conditions, the GOWER

similarity coefﬁcient and the LAFLIN coefﬁcient per-

form similarly. The constraints are as follows: 1) the

dataset does not include asymmetric binary variables,

and 2) the distance and similarity measures for each

of the variables are the same. Recall from Section

4.3 that our results for numeric variables indicate that

the Euclidian distance may not be the best choice for

numeric variables. This choice seems to impact the

performance of the LAFLIN coefﬁcient, which may

be improved by using a different distance measure for

the numeric variables.

5 CONCLUSIONS

In cluster analysis, the selection of proximity mea-

sures is a crucial step that has a huge impact on the

quality and usability of the end results. However,

this fact is frequently overlooked, leading to a degrad-

ing of the potential knowledge being discovered. To

address this issue, this paper presents an explorative

and comparative study of the performance of various

proximity measures when applied to the spectral clus-

tering algorithms. In particular, our study address the

question when, and where, the choice of proximity

measure becomes crucial in order to succeed. Our

results indicate that proximity measures needs spe-

cial care in domains where the data is highly imbal-

anced and where in is important to correctly cluster

the boundary objects. These cases are of special inter-

est in application areas such as rare disease diagnosis,

oil spill detection and fraud detection.

Our future work will consider a diverse selection

of datasets. We aim to evaluate if our conclusions

hold for sparse datasets with noise and many missing

values. We will also extend our research to very large

datasets with high dimensionality. For such datasets,

these proximity measures may not perform as per our

expectation. That is, with high dimensions, the data

may become sparse and the distance computed from

these measures may not capture similarities properly.

In such cases, a different set of proximity measures

may be required to deal with the problem of high di-

mensionality.

The selection of the most suitable proximity mea-

sures when speciﬁcally aiming to detect outliers and

anomalies is another topic of future research. In order

to reach a conclusion with higher generality, we are

interested to see whether the conclusions drawn from

our paper persist for other clustering algorithms. The

development of additional measures for mixed data

types, especially ones that do not use the Euclidian

distance for numeric data, are also a signiﬁcant issue

which will beneﬁt from being further researched.

REFERENCES

Abou-Moustafa, K. T. and Ferrie, F. P. (2007). The mini-

mum volume ellipsoid metric. In Proceedings of the

29th DAGM conference on Pattern recognition, pages

335–344, Berlin, Heidelberg. Springer-Verlag.

Aiello, M., Andreozzi, F., Catanzariti, E., Isgro, F., and San-

toro, M. (2007). Fast convergence for spectral cluster-

ing. In ICIAP ’07: Proceedings of the 14th Interna-

tional Conference on Image Analysis and Processing,

pages 641 – 646, Washington, DC, USA. IEEE Com-

puter Society.

Asuncion, A. and Newman, D. (2007). UCI Machine Learn-

ing Repository.

Bach, F. R. and Jordan, M. I. (2003). Learning spectral clus-

tering. In Advances in Neural Information Process-

ing Systems 16: Proceedings of the 2003 conference,

pages 305–312. Citeseer.

Bach, F. R. and Jordan, M. I. (2006). Learning spectral

clustering, with application to speech separation. J.

Mach. Learn. Res., 7:1963–2001.

Boslaugh, S. and Watters, P. A. (2008). Statistics in a nut-

shell. O’Reilly & Associates, Inc., Sebastopol, CA,

USA.

Costa, I. G., de Carvalho, F. A. T., and de Souto, M. C. P.

(2002). Comparative study on proximity indices for

cluster analysis of gene expression time series. Jour-

nal of Intelligent and Fuzzy Systems: Applications in

Engineering and Technology, 13(2-4):133 – 142.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

Everitt, B. S. (1980). Cluster Analysis. Edward Arnold and

Halsted Press, 2nd edition.

Filzmoser, P., Garrett, R., and Reimann, C. (2005). Multi-

variate outlier detection in exploration geochemistry.

Computers and Geosciences, 31(5):579–587.

Fischer, I. and Poland, J. (2004). New methods for spectral

clustering. Technical Report IDSIA-12-04, IDSIA.

Han, J. and Kamber, M. (2006). Data Mining: Concepts

and Techniques, 2nd Ed. Morgan Kaufmann Publish-

ers Inc., San Francisco, CA, USA.

Heinz, G., Peterson, L. J., Johnson, R. W., and Kerk, C. J.

(2003). Exploring relationships in body dimensions.

Journal of Statistics Education, 11(2).

Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data

clustering: a review. ACM Computing Surveys,

31(3):264–323.

Japkowicz, N. and Shah, M. (2011). Performance Evalua-

tion for Classiﬁcation A Machine Learning and Data

Mining Perspective (in progress): Chapter 6: Statisti-

cal Signiﬁcance Testing.

Kaufman, L. and Rousseeuw, P. (2005). Finding Groups

in Data: An Introduction to Cluster Analysis. Wiley-

Interscience.

Kubat, M., Holte, R. C., and Matwin, S. (1998). Machine

learning for the detection of oil spills in satellite radar

images. Machine Learning, 30(2 - 3):195 – 215.

Larose, D. T. (2004). Discovering Knowledge in Data: An

Introduction to Data Mining. Wiley-Interscience.

Lee, S. and Verri, A. (2002). Pattern Recognition With Sup-

port Vector Machines: First International Workshop,

Svm 2002, Niagara Falls, Canada, August 10, 2002:

Proceedings. Springer.

Luxburg, U. (2007). A tutorial on spectral clustering. Statis-

tics and Computing, 17(4):395–416.

Meila, M. and Shi, J. (2001). A random walks view of spec-

tral segmentation. In International Conference on Ar-

tiﬁcial Intelligence and Statistics (AISTAT), pages 8–

11.

Ng, A. Y., Jordan, M. I., and Weiss, Y. (2001). On spectral

clustering: Analysis and an algorithm. In T. G. Diet-

terich, S. B. and Ghahramani, Z., editors, Advances in

Neural Information Processing Systems, volume 14,

pages 849–856.

Paccanaro, A., Casbon, J. A., and Saqi, M. A. (2006). Spec-

tral clustering of protein sequences. Nucleic Acids

Res, 34(5):1571–1580.

Shi, J. and Malik, J. (2000). Normalized cuts and image

segmentation. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 22(8):888–905.

Steinbach, M., Karypis, G., and Kumar, V. (2000). A

comparison of document clustering techniques. KDD

Workshop on Text Mining.

Teknomo, K. (2007). Similarity Measurement. Website.

Verma, D. and Meila, M. (2001). A comparison of spectral

clustering algorithms.

Webb, A. R. (2002). Statistical Pattern Recognition, 2nd

Edition. John Wiley & Sons.

Witten, I. H. and Frank, E. (2005). Data Mining: Practi-

cal Machine Learning Tools and Techniques. Morgan

Kaufmann, 2 edition.

A COMPARATIVE EVALUATION OF PROXIMITY MEASURES FOR SPECTRAL CLUSTERING