Data Clustering Method based on Mixed Similarity Measures

Doaa S. Ali, Ayman Ghoneim and Mohamed Saleh

Department of Operations Research and Decision Support,

Faculty of Computers and Information, Cairo University

5 Dr. Ahmed Zewail Street, Orman, 12613 Giza, Egypt

Keywords: Mixed Datasets, Similarity Measures, Data Clustering Algorithms, Differential Evolution.

Abstract: Data clustering aims to organize data and concisely summarize it according to cluster prototypes. There are

different types of data (e.g., ordinal, nominal, binary, continuous), and each has an appropriate similarity

measure. However when dealing with mixed data set (i.e., a dataset that contains at least two types of data.),

clustering methods use a unified similarity measure. In this study, we propose a novel clustering method for

mixed datasets. The proposed mixed similarity measure (MSM) method uses a specific similarity measure

for each type of data attribute. When computing distances and updating clusters’ centers, the MSM method

merges between the advantages of k-modes and K-means algorithms. The proposed MSM method is tested

using benchmark real life datasets obtained from the UCI Machine Learning Repository. The MSM method

performance is compared against other similarity methods whether in a non-evolutionary clustering setting

or an evolutionary clustering setting (using differential evolution). Based on the experimental results, the

MSM method proved its efficiency in dealing with mixed datasets, and achieved significant improvement in

the clustering performance in 80% of the tested datasets in the non-evolutionary clustering setting and in

90% of the tested datasets in the evolutionary clustering setting. The time and space complexity of our

proposed method is analyzed, and the comparison with the other methods demonstrates the effectiveness of

our method.

1 INTRODUCTION

Unsupervised clustering aims to extract the natural

partitions in a dataset without a priori class

information. It groups the dataset observations into

clusters where observations within a cluster are more

similar to each other than observations in other

clusters (Bhagat et al., 2013; Tiwari and

Jha, 2012).

The K-means clustering algorithm is efficiently used

when processing numerical datasets, where means

serve as centers/centroids of the data clusters. In the

K-means algorithm, observations are partitioned into

K clusters where an observation belongs to the

cluster with the closest mean (i.e., centroid)

(Serapião et al., 2016). When dealing with

categorical data (Bai et al., 2013; Kim, 2008), K-

modes (Ammar and Lingras, 2012) and K-medoids

(Mukhopadhyay and Maulik, 2007) clustering

algorithms are used instead of K-means. In the K-

modes algorithm, modes replace means as the

dissimilarity measure and it uses a frequency based

method to update modes during the clustering

process. On the other hand, K-medoids algorithm

computes a cluster medoid instead of computing the

mean of cluster. A medoid is a representative

observation in a cluster, where the sum of distances

to other observations in the cluster is minimal

(Mukhopadhyay and Maulik, 2007).

There are four main types of data attributes,

which are nominal, ordinal, binary, and numerical.

Ordinal and nominal attributes are used to describe

categorical data. Nominal attributes are used for

labeling variables without any quantitative value.

Nominal attributes are mutually exclusive (no

overlap) and none of them have any numerical

significance such as name, gender, and colors.

Ordinal data attributes have ordered values to

capture importance and significance, but the

differences are not quantified such as (excellent,

very good, good and bad) and (very happy, happy,

and unhappy). Numerical data attributes can be

either discrete or continuous (e.g., temperature,

height and weight). Distance or similarity measures

are used to solve many pattern recognition problems

such as classification, clustering, and retrieval

192

S. Ali D., Ghoneim A. and Saleh M.

Data Clustering Method based on Mixed Similarity Measures.

DOI: 10.5220/0006245601920199

In Proceedings of the 6th International Conference on Operations Research and Enterprise Systems (ICORES 2017), pages 192-199

ISBN: 978-989-758-218-9

problems (Cha, 2007). A distance is mathematically

defined as a quantitative degree of how far apart two

data points are. The choice of distance/similarity

measures depends on the type of data attributes in

the processed dataset.

Most of the traditional clustering models are

built to deal with either numerical data or categorical

data. However in the real world, the collected data

often have both numeric and categorical attributes

(i.e., a mixed dataset). Thus it’s hard to apply

traditional clustering algorithm directly to such

mixed datasets. When it comes to dealing with

mixed datasets, previous work adopted two

approaches. The first approach unified the used

similarity measure when dealing with mixed datasets

(e.g., Parameswari et al., 2015; Shih et al., 2010 and

Soundaryadevi and Jayashree, 2015). It converts the

mixed dataset either to pure numerical data or to

pure categorical data using a pre-processing step

before applying the clustering algorithm.

Unfortunately, this approach is not practical because

there are data instances where the conversion does

not give meaningful numerical data. Furthermore,

this conversion may lead to loss of information. The

second approach divides the original dataset into

pure numerical and categorical dataset (e.g. Asadi et

al., 2012; Ahmad, 2007; Shih et al., 2010;

Mutazinda et al., 2015; and Pinisetty et al., 2012).

The appropriate clustering algorithms are used to

produce corresponding clusters for these pure

datasets. The clustering results on the categorical

and numerical datasets are then combined as a

categorical dataset on which a categorical data

clustering algorithm is employed to get the final

output. This approach suffers from excessive

complexity through the implementation, especially

in the case of dealing huge/large dataset.

Recently, researchers have given much attention

to distance learning metric for semi-supervised

clustering algorithms (e.g. Relevant Component

Analysis, Discriminative Component Analysis) at

handling mixed/or complicated datasets (Kumar and

Kummamuru, 2008; Baghshah and Shouraki, 2009).

Semi-supervised learning clustering algorithms

partition a given dataset using additional supervisory

information (Kumar and

Lingras, 2008). The most

popular form of supervision used in this category of

clustering algorithms is in terms of pairwise

constraints. Learning in a distance metric is

equivalent to finding a rescaling of a given dataset

by applying the standard Euclidean metric (Xing,

2003). Distance learning metric is mainly processed

for semi-supervised clustering algorithms and also

suffers from exaggerated complexity through the

implementation.

To overcome the previous limitations, we

introduce a novel clustering method for the mixed

datasets. The proposed mixed similarity measure

(MSM) method uses the appropriate similarity

measure for each type of data attribute. It combines

the capabilities of the K-modes and K-means

algorithms when computing distances and updating

centers for the clusters. The proposed MSM method

is tested using six benchmark real life datasets

obtained from the UCI Machine Learning

Repository (Blake and Merz, 1998), and it achieved

a significant improvement in the clustering

performance in a non-evolutionary clustering setting

and in an evolutionary clustering setting.

The time

and space complexity of our proposed method is

analyzed, and the comparison with the other

methods proves the effectiveness of our method.

The rest of the paper is organized as follows.

Section 2 introduces some related works and a

background to K-means, K-modes algorithms, and

differential evolution. Section 3 presents the

proposed MSM method. Section 4 illustrates the

differential evolution MSM setting. Section shows

the experimental results and analyses. Section 6

concludes the work and discusses future works.

2 BACKGROUND

In this section, we cover preliminary concepts

needed in our work. These preliminary concepts are

the clustering problem, K-means and K-modes

clustering algorithms, and differential evolution

algorithm.

2.1 Clustering Problem

Formally, a clustering problem is represented as an

optimization problem as follows:



,





μ,











,



 1, 1









1

where n is the number of data points, k is the

number of data clusters, and µ

is a membership of

data observation to cluster j (i.e. 



takes binary

values in crisp case). 



,



 is the matching

distance measure between data point x

and data

cluster center z

j .

Data Clustering Method based on Mixed Similarity Measures

193

2.2 K-Means Clustering Algorithm

The K-means algorithm is a widely used clustering

algorithm for numerical data sets because of its

simplicity (Bai et al., 2013). K-means algorithm

searches for nearly optimal partitions with a fixed

number of clusters. The algorithm aims to minimize

total distances between data points and centers (Wu

et al., 2008) where





,















(2)

is the distance measure between data point x

and

data cluster center z

. The steps of K-means

clustering algorithm are as follows (Kim and

Hyunchul, 2008):

Randomly initialize centers for the k clusters

Each data point is assigned to the cluster with the nearest

center (Eq. 2).

Update the center of each cluster.

Repeat steps 2 and 3 until the clusters’ centers stop changing

or other stopping criteria are met.

Procedure 1: Steps of K-Means algorithm.

In step 3, the j

cluster center is updated by

taking the mean of data observations which are

grouped in cluster j in step 2.

2.3 K-Modes Clustering Algorithm

K-modes clustering algorithm extends the K-means

algorithm to cluster categorical data (Gibson et al.,

1998), by replacing means of clusters by modes. K-

modes algorithm uses a simple matching distance

(Aranganayagi and

Thangavel

, 2009), or a hamming

distance when measuring distances between data

observations. To understand the matching distance

measure, let x and y be two data observations in D

dataset and L be the number of attributes in a data

observation. The simple matching distance measure

between x and y in D is defined as:









,











,









(3)

where









,







0









1



The steps of the k-modes clustering algorithm is

similar to the k-means algorithm (Procedure 1),

except that the center of cluster is updated according

to the following equation:











∈









, ∈



(4)

where 



represents the new updated value of cluster

j in the 



attribute, and 



is the value of the data

observation r which has the most frequent value in

the 



attribute for the data observations within

cluster j. With respect to 



, it expresses all the

possible values which can be taken by the attribute

and DOM is a domain of this attribute. 



is the

total number of data observations in cluster j.

2.4 Differential Evolution

Differential evolution (DE) is a population-based

global optimization algorithm that uses a real-coded

representation (Saha et al., 2010). DE belongs to the

class of genetic algorithms since it uses selection,

crossover, and mutation operators to optimize an

objective function over the course of successive

generations (Suresh et al., 2009). The DE operators

are as follow:

1. Mutation operator: In generation t, let 

,

be the

solution vector in the population of size NP

(i.e., ∈1,2,…,). For each solution vector



,

, a mutant vector 

,

is generated using three

randomly picked solutions from the population

using the following equation:



,





,





,





,



(5)

where 



,



,



∈1,2,…, are three mutually

distinct random numbers and 



,



,



, and

∈0,2 is a real number representing the

differential weight.

2. Crossover operator: Let L be the dimension of a

solution vector and 1,2,…, be the index for

the dimension. The mutant vector 

,

and the

target solution vector 

,

are crossed to generate a

trial solution vector



,



,

,

,

,…,

,



(6)

where



,





,

,













,

,











where 



∈



0,1



is a uniformly generated random

number, ∈



0,1



is the crossover probability,

and 







∈1,2,…, is a randomly chosen

dimension index.

3. Selection operator: The trail vector 

,

compared against 

,

and will replace it in the

population if the following condition is met where







is the fitness function:



,





,

,





,







,





,

, .

(7)

ICORES 2017 - 6th International Conference on Operations Research and Enterprise Systems

194

3 MSM METHOD

The proposed MSM method is a novel clustering

model based on using different similarity measures

when dealing with mixed datasets. The MSM

method has a pool of different similarity measures

and uses them according to the type of data attribute

under consideration. When computing distances and

updating centroids, the MSM method merges

between the capabilities of k-modes and K-means

algorithms. Thus, we modify some steps in the

traditional clustering model. Procedure 2 shows the

steps of the MSM method. These modified steps are

explained in details in the next sub-sections.

All data elements are assigned a cluster number between 1

and k randomly, where k is the number of clusters desired.

Find the cluster center of each cluster.

For each data element, find the cluster center that is closest

to the element. Assign the element to the cluster whose

center is closest to it.

Re-compute the cluster centers with the new assignment of

elements.

Repeat steps 3 and 4 till clusters do not change or for a

fixed number of times.

Procedure 2: Steps of the MSM method.

3.1 Computing Distances

In the proposed MSM method, let A and B be two

mixed data points with m attributes. When

computing the distance between A and B, the MSM

method calls the similarity measure according to the

attribute type, and compute a sub-distance between

the attribute in A and the same attribute in B. The

total distance between A and B is the sum of the

sub-distances for the m attributes. The used

similarity measures are normalized to be in the [0, 1]

interval as follows:

 For ordinal data attribute



,





,







1

(8)

where z

,

isthe standarized value of attribute



of the data object i, r

,

 isthedifference

value before standardization, M



is the upper

limit of the domain of attribute a



.

 For binary and nominal data attribute, we use the

matching distance (Equation 3).

 For numerical data attribute, we use the

following equation

zijn



,



,



max



min



(9)

where z

,

is the standardized difference value

of attribute a



between two data objects i and

j, x

,

andx

,

are the values of attribute a



of

object i and j before standardization, max x

and min x

are the upper and lower limit of the

domain of attribute a



, respectively.

Figure 1 shows an example of two mixed data

points A and B. The first two attributes are

binary and nominal, so the matching distance

is used in measuring the distance between

them. The third attribute is ordinal, so the sub-

distance is calculated using equation 4, where

the domain of this attribute is from 1 to 4. The

last attribute is numerical and has the range

[150, 175], so the sub-distance is calculated by

equation 6. Finally, the total distance between

A and B is the sum of these sub-distances,

which will be 1.73.

Figure 1: An example of calculating the distances in the

MSM method.

3.2 Updating Centers

Generally speaking, the step of updating centers

differs according to the type of data (e.g., categorical

or numerical). Thus when updating centers, the

proposed MSM method updates each value of

attribute according also to the data type (see Figure

2). If the value of attribute is numerical, then we use

the updating rule of the k-means algorithm.

However if the value of attribute is categorical, then

we use the updating rule of the k-modes algorithm.

Data Clustering Method based on Mixed Similarity Measures

195

Figure 2: Example for updating centers in the MSM

method.

4 EXPERIMENTAL DESIGN

Measuring similarity between data points is a corner

stone in the clustering process, whether it is a non-

evolutionary clustering setting (e.g., Procedures 1

and 2) or in an evolutionary clustering setting. Thus

to evaluate the performance of the MSM method, we

compared it against other existing similarity

measures in (Boriah et al., 2008) (i.e., matching

distance, IOF, and Eskin similarity measures) in

addition to the scaling method in (Parameswari et

al., 2015) assuming both the non-evolutionary and

evolutionary settings. Evolutionary computation

techniques play a vital role in improving the data

clustering performance because of its ability to avoid

falling in local optimal solutions.

We use differential evolution (DE) as an

evolutionary technique, where a similarity measure

becomes a sub-routine used within the evolutionary

setting. For DE with the MSM method (denoted by

DE-MSM), procedure 3 illustrates the steps of the

algorithm. In step 3, the initialized centers of

clusters are randomly determined. The next steps

represent the main part of the proposed method,

where it starts with updating centers, then updating

distances. The mutation and crossover operators then

have to be applied using Equations 5 and 6,

respectively. The resulting new individual is a

candidate which is evaluated against its parent using

Equation 7 to select the one with the better fitness.

When reaching the maximum number of iterations,

we use the accuracy measure performance (Arbelaitz

et al., 2013) to select the best individual of the final

population.

For the DE, we use a population size of 100

individuals (i.e., 100 different sets of centers),

maximum number of iterations of 100, and

crossover rate CR of 0.2. These parameters are

chosen based on preliminary experiments.

5 EXPERIMENTAL RESULTS

AND DISCUSSIONS

The proposed method is tested on six real-life mixed

datasets obtained from the UCI Machine Learning

Repository (Blake and Merz, 1998). The obtained

results of 100 independent runs are summarized in

table 1 for the non-evolutionary setting. Table 1

contains the mean and standard deviation of best

result of accuracy. We compare the MSM method

against three similarity measures(i.e., matching

distance, IOF, Eskin, and Scaling) already existing

in the literature. We performed T-test with

Input: D = the used dataset, K = number of data clusters, NP = population size

Output: clusters assignment

Add randomly initialized clusters’’ centers (i.e., individuals of population).

Evaluate the fitness of all individuals.

While Stopping Criterion (i.e., maximum number of iterations) is not met; do:

For each Individual Pi (i = 1 … NP) in the population, do:

a) Update centers of the k clusters.

b) Update distance between data objects and the updated centers of clusters.

c) Apply the mutation operator using Eq. 5.

10:

d) Apply the crossover using Eq. 6.

11:

e) Evaluate the fitness of the offspring C from parent Pi.

12:

f) Apply selection operator to create new-population by comparing the offspring C against its parent Pi using Eq. 7.

13:

End For

14:

End While

15:

Calculate the accuracy measure performance for every individual in the final population.

16:

Select the best solution (i.e., set of centers) which has the highest accuracy.

Procedure 3: The DE-MSM method.

ICORES 2017 - 6th International Conference on Operations Research and Enterprise Systems

196

Table 1: Mean ± standard deviation of best solution of 100 independent runs for the simple matching, IOF, Eskin, Scaling

and the proposed MSM method.

Simple

Matching

IOF Eskin Scaling MSM T-test

Breast Cancer 0.8128434 ±

2.69461E-06

0.771992 ±

0.001752535

0.782972 ±

0.001451745

0.814782 ±

0.0027383

0.839089 ±

6.0179E-06

Significant

Zoo 0.8787367 ±

0.000736404

0.861041 ±

0.000184208

0.880504 ±

0.00144237

0.885224 ±

0.0056389

0.913004 ±

0.000432323

Significant

Hepatitis 0.766462 ±

0.000562314

0.710596 ±

0.003786261

0.669242 ±

0.00143719

0.769892 ±

0.0056282

0.8187971 ±

2.72221E-05

Significant

Heart Diseases 0.7520178 ±

9.35633E-06

0.778464 ±

0.001182946

0.6315967 ±

0.000205821

0.761143 ±

0.00088239

0.7953947 ±

1.06071E-05

Significant

Dermatology 0.8476637 ±

0.00152124

0.699989 ±

0.00055469

0.6957118 ±

0.000270416

0.856321 ±

0.0003345

0.8424427 ±

3.90709E-05

Significant

Credit 0.9043666 ±

4.05246E-06

0.864447 ±

0.003066162

0.6360959 ±

0.001083251

0.91882 ±

0.0004267

0.8960072 ±

1.21558E-05

Significant

confidence level 0.05 to illustrate the statistical

significant of the results obtained by the MSM

method and the second best similarity measure. As

shown in Table 1, the MSM method obtained

statistically significant better results for four

datasets, while simple matching obtained better

results for two datasets (where one is not statistically

significant). Based on the results, the proposed

MSM methods performed better when compared

with the other similarity methods, and it improved in

about 80% of the tested datasets. Moreover, Table 2

lists the run time of the five clustering similarity

methods on different datasets. From Table 2, we can

see that the MSM method needs more time than the

simple matching method. However, the MSM

method consumes time less than IOF, Eskin, and

Scaling methods.

Table 2: The running time of the five clustering models on

the used datasets.

Average Running Time (Minutes)

Simple

Matching

IOF Eskin Scaling MSM

Breast

Cancer

4.82 5.33 5.47 4.97 4.94

Zoo 2.11 2.26 2.34 2.17 2.10

Hepatitis 2.69 3.24 3.32 2.87 2.63

Heart

Diseases

3.17 3.38 3.41 3.27 3.22

Dermatology 3.68 3.87 3.96 3.71 3.72

Credit 5.19 5.42 5.49 5.34 5.21

We now move to the evolutionary clustering

setting, where each similarity measure is used as a

sub-routine to compute distances and update centers

in the DE algorithm. For the same six real-life mixed

datasets, the obtained results of the 100 independent

runs are reported in Table 3. Table 3 contains the

mean and standard deviation of best result of the

accuracy measure performance. To compare our

results, we compared the DE with different

similarity measures (i.e., DE-MSM, DE-Simple

matching, DE-IOF, DE-Eskin,

DE-Scaling). Based

on the experimental results, the DE setting (Table 3)

yields higher accuracy compared to the non-

evolutionary setting (Table 1). In addition as shown

in Table 3, the DE-MSM obtained statistically

significant better results for five datasets, while

simple matching obtained better results for one

dataset.

6 CONCLUSION AND FUTURE

WORK

In this study, we proposed a novel clustering MSM

method for the mixed datasets (i.e., datasets with at

least two types of data attributes). In contrast to

existing approaches in literature dealing with mixed

datasets, the MSM method assigns a unique

similarity measure for each type of data attribute

(e.g., ordinal, nominal, binary, continuous). When

dealing with a pure dataset (i.e., with only one type

of data attributes), the MSM method will reduce to

the K-means or the K-modes algorithms. Using six

benchmark real life mixed datasets from the UCI

Machine Learning Repository, we first compared the

performance of the MSM method against other

similarity measures (i.e., simple matching, IOF,

Eskin, and Scaling) in a non-evolutionary setting.

Data Clustering Method based on Mixed Similarity Measures

197

Table 3: Mean ± standard deviation of best solution of 100 independent runs for the DE-simple matching, DE-IOF, DE-

Eskin, DE-Scaling, and DE-MSM.

DE-Simple

Matching

DE-IOF DE-Eskin DE-Scaling DE-MSM T-test

Breast Cancer

0.823201 ±

00013254

0.7901874 ±

0.000231

0.805437 ±

0.006119

0.82289 ±

000245

0.8472614 ±

0.07811E-05

Significant

Zoo

0.90132 ±

0.0002621

0.884791±

0.6119E-04

0.899645 ±

0.00332

0.908892 ±

0.002583

0.9435833 ±

2.52812 E-06

Significant

Hepatitis

0.798517±

0.003213

0.769026 ±

0.00371

0.734618 ±

1.842E-04

0.797582±

0.0007739

0.83306326 ±

7.2235E-05

Significant

Heart

Diseases

0.762825 ±

0.000765

0.7356806 ±

2.5723E-05

0.6571352 ±

0.00422

0.774329 ±

0.000113

0.82840165 ±

3.77392E-05

Significant

Dermatology

0.85060403

± 0.000113

0.7285605 ±

0.00117

0.705437 ±

0.0005632

0.8505721 ±

0.00017

0.86351823 ±

1.4426 E-04

Significant

Credit

0.9392598 ±

0.0006234

0.88369739

± 0.000921

0.7401278 ±

3.48192E-04

0.940456 ±

0.000253

0.91358951 ±

0.000218

Significant

The experimental results showed that the MSM

method achieved statistically significant accuracy in

80% of the tested datasets. We then move to

evolutionary setting using DE where similarity

measures were used to compute distance and update

centers during the search process. DE showed its

ability to improve the clustering performance

compared to the non-evolutionary setting, and DE-

MSM achieved statistically significant accuracy in

90% of the tested datasets compared to DE-simple

matching, DE-IOF, DE-Eskin and DE-Scaling. The

time and space complexity of our proposed method

is analyzed, and the comparison with the other

methods confirms the effectiveness of our method.

For future work, the proposed MSM and/or DE-

MSM methods can be used in a multiobjective data

clustering framework to deal specifically with mixed

datasets. Furthermore, the current work can be

extended to data clustering models with uncertainty.

REFERENCES

Ahmad, Dey L., 2007, A k-mean clustering algorithm for

mixed numeric and categorical data, Data &

Knowledge Engineering, 63, pp. 503–527.

Ammar E. Z., Lingras P., 2012, K-modes clustering using

possibilistic membership, IPMU 2012, Part III, CCIS

299, pp. 596–605.

Aranganayagi S., Thangavel K., 2009, Improved K-

modes for categorical clustering using weighted

dissimilarity measure, International Journal of

Computer, Electrical, Automation, Control and

Information Engineering, 3 (3), pp. 729–735.

Arbelaitz O., Gurrutxaga I., Muguerza J., Rez J. M.,

Perona I., 2013, An extensive comparative study of

cluster validity indices, Pattern Recognition (46), pp.

243–256.

Asadi S., Rao S., Kishore C., Raju Sh., 2012, Clustering

the mixed numerical and categorical datasets using

similarity weight and filter method, International

Journal of Computer Science, Information Technology

and Management, 1 (1-2).

Baghshah M. S., Shouraki S. B., 2009, Semi-supervised

metric learning using pairwise constraints,

Proceedings of the Twenty-First International Joint

Conference on Artificial Intelligence (IJCAI), pp.

1217–1222.

Bai L., Lianga J., Dang Ch., Cao F., 2013, A novel fuzzy

clustering algorithm with between-cluster information

for categorical data, Fuzzy Sets and Systems, 215, pp.

55–73.

Bai L., Liang J., Sui Ch., Dang Ch., 2013, Fast global k-

means clustering based on local geometrical

information, Information Sciences, 245, pp. 168-180.

Bhagat P. M., Halgaonkar P. S., Wadhai V. M., 2013,

Review of clustering algorithm for categorical data,

International Journal of Engineering and Advanced

Technology, 3 (2).

Blake, C., Merz, C., 1998. UCI repository machine

learning datasets.

Boriah Sh., Chandola V., Kumar V., 2008, Similarity

measures for categorical data: A comparative

evaluation. The Eighth SIAM International

Conference on Data Mining. pp. 243–254.

Cha S., 2007, Comprehensive survey on

distance/similarity measures between probability

density functions, International journal of

mathematical models and methods in applied sciences,

1(4), pp. 300–307.

Gibson D., Kleinberg J., Raghavan P., 1998, Clustering

categorical data: An approach based on dynamical

systems, In 24th International Conference on Very

Large Databases, pp. 311–322.

ICORES 2017 - 6th International Conference on Operations Research and Enterprise Systems

198

Kim K.K., Hyunchul A., 2008, A recommender system

using GA K-means clustering in an online shopping

market, Elsevier Journal, Expert Systems with

Applications 34, pp. 1200–1209.

Kumar N., Kummamuru K., 2008, Semi-supervised

clustering with metric learning using relative

comparisons, IEEE Transactions on Knowledge and

Data Engineering, 20 (4), pp. 496–503.

Mukhopadhyay A., Maulik U., 2007, Multiobjective

approach to categorical data clustering, IEEE

Congress on Evolutionary Computation, pp. 1296 –

1303.

Mutazinda H., Sowjanya M., Mrudula O., 2015, Cluster

ensemble approach for clustering mixed data,

International Journal of Computer Techniques, 2 (5),

pp. 43–51.

Parameswari P., Abdul Samath J., Saranya S., 2015,

Scalable clustering using rank based preprocessing

technique for mixed data sets using enhanced rock

algorithm, African Journal of Basic & Applied

Sciences, 7 (3), pp. 129–136.

Pinisetty V.N. P., Valaboju R., Rao N. R., 2012, Hybrid

algorithm for clustering mixed data sets, IOSR Journal

of Computer Engineering, 6, pp 9–13.

Saha, D. Plewczy´nski, Maulik U., Bandyopadhyay S.,

2010, Consensus multiobjective differential crisp

clustering for categorical data analysis, RSCTC, LNAI

6086, pp. 30–39.

Serapião B. S., Corrêa G. S. , Gonçalves F. B. , Carvalho

V. O., 2016, Combining K-means and K-harmonic

with fish school search algorithm for data clustering

task on graphics processing units, Applied Soft

Computing, 41, pp. 290–304.

Shih M., Jheng J., Lai L., 2010, A two-step method for

clustering mixed categroical and numeric data,

Tamkang Journal of Science and Engineering, 13 (1),

pp. 11–19.

Soundaryadevi M., Jayashree L.S., 2014, Clustering of

data with mixed attributes based on unified similarity

metric, Proceedings of International Conference On

Global Innovations In Computing Technology, pp.

1865–1870.

Suresh K., Kundu D., Ghosh S. , Das S., Han, Y. S., 2009,

Multi-Objective Differential Evolution for Automatic

Clustering with Application to Micro-Array Data

Analysis, Sensors, 9(5), pp. 3981–4004.

Tasdemir K., Merényi E., 2011, A validity index for

prototype-based clustering of data sets with complex

cluster structures, IEEE transactions on systems, man,

and cybernetics—part b, 41(4), pp. 1039–1053.

Tiwari M., Jha M. B., 2012, Enhancing the performance of

data mining algorithm in letter image recognition data,

International Journal of Computer Applications in

Engineering Sciences, II (III), pp. 217–220.

Wu X., Kumar V., Quinlan J. R., Ghosh J., Yang Q. ,

Motoda H., McLachlan G. J., Ng A., Liu B. , Yu Ph.

S., Zhou Zh., Steinbach M., Hand D. J., Steinberg D.,

2008, Top 10 algorithms in data mining, Knowledge

Information System, 14, pp. 1–37.

Xing E., 2003, Distance metric learning with application

to clustering with side-information, in NIPS, pp. 505–

512.

Data Clustering Method based on Mixed Similarity Measures

199