Semi-supervised Distributed Clustering for Bioinformatics - Comparison

Study

Huayiing Li and Aleksandar Jeremic

Dept. of Electrical and Computer Engineering, McMaster University, Hamilton, Canada

Keywords:

Information Fusion, Bioinformatics, Distributed Clustering.

Abstract:

Clustering analysis is a widely used technique in bioinformatics and biochemistry for variety of applications

such as detection of new cell types, evaluation of drug response, etc. Since different applications and cells

may require different clustering algorithms combining multiple clustering results into a consensus clustering

using distributed clustering is a popular and efﬁcient method to improve the quality of clustering analysis.

Currently existing solutions are commonly based on supervised techniques which do not require any a priori

knowledge. However in certain cases, a priori information on particular labelings may be available a priori. In

these cases it is expected that performance improvement can be achieved by utilizing this prior information.

To this purpose in this paper, we propose two semi-supervised distributed clustering algorithms and evaluate

their performance for different base clusterings.

1 INTRODUCTION

Mutation is an accidental change in genomic se-

quence of DNA (Pickett, 2006) and has been often

used in biochemistry in order to produce to improve

features of different objects such as plants, drugs, etc.

These changes are usually observed (monitored) us-

ing ﬂuorescence microscopy, an important tool for vi-

sualizing biochemical activity within individual cells.

Automated analysis of these imagestypically involves

acquiring high resolution images and translating them

into a multi-dimensional feature space, which spans

hundreds of features per ﬂuorescence channel and

will be further processed to provide relevant output

(Shariff et al., 2010) which is commonly done us-

ing clustering algorithms. Although there are many

clustering algorithms exist in the literature, no single

algorithm can correctly identify underlying structure

of all data sets in practice (Xu and Wunsch, 2008).

Combing multiple clusterings into a consensus label-

ing is a hard problem because of two reasons: (1)

number of clusters could be different and (2) label

correspondence problem. In (Vega-Pons and Ruiz-

Shulcloper, 2011), the authors provide a detailed re-

view of many existing algorithms: some algorithms

are based on relabeling and voting; some are based

on co-association matrix. All of these algorithms are

unsupervised learning because input data set is un-

labeled and clusters are not pre-deﬁned. Also, most

of cluster ensemble algorithms consists of two ma-

jor steps: cluster ensemble generation and consensus

fusion. Different from the distributed detection prob-

lem, information fusion for cluster analysis is more

difﬁcult because of at least the following two rea-

sons: (1) the number of clusters in each clustering

could be different and the desired number of clus-

ters is usually unknownand (2) the cluster labels from

different clusterings are symbolic and the same sym-

bolic label from different clusterings sometimes cor-

responds to different clusters. Therefore, a correspon-

dence problem is always accompanied with cluster-

ing ensemble problem (Strehl and Ghosh, 2003). The

common way to aviod the correspondence problem

(Dudoit and Fridlyand, 2003; Fred and Jain, 2005)

is to construct a pairwise similarity matrix between

data points. In (Strehl and Ghosh, 2003), the authors

proposed three algorithms based on hypergraph rep-

resentation of clusterings to solve the ensemble prob-

lem. In the meta-clustering algorithm (MCLA), the

clusters of a local clustering are represented by hyper-

edges. Many other approaches to combine the base

clustering have been proposed in the literature, such

as relabelling and voting based and mixture-densities

based approach.

In this paper we propose f two semi-supervise

clustering algorithms: soft and hard decision mak-

ing versions and compare their performances. For

the soft semi-supervised clustering ensemble algo-

rithm (SSEA), the average association vector is com-

Li H. and Jeremic A.

Semi-supervised Distributed Clustering for Bioinformatics - Comparison Study.

DOI: 10.5220/0006253502590264

In Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), pages 259-264

ISBN: 978-989-758-212-7

259

puted for each data points and all the average associ-

ation vectors are normalized to derive the soft con-

sensus label matrix for the given data set. For the

hard semi-supervised clustering ensemble algorithm

(HSEA), the hard consensus clustering is generated

from two approaches. One approach is to assign each

data point its most associated cluster id based on its

average association vector. This version is named as

soft to hard semi-supervised clustering ensemble al-

gorithms (SHSEA). The other approach is to relabel

the set of base clusterings by assigning each data point

its most associated cluster id according to each base

clustering and to derive the hard consensus clustering

by majority voting. This is considered as hard to hard

semi-supervised clustering ensemble algorithm (HH-

SEA).

2 DISTRIBUTED CLUSTERING

In the literature, many clustering ensemble algo-

rithms have been proposed and can be broadly di-

vided into different categories, such as relabelling

and voting based, co-association based, hypergraph

based and mixture-densities based clustering ensem-

ble algorithms (Ghaemi et al., 2009), (Vega-Pons

and Ruiz-Shulcloper, 2011), (Aggarwal and Reddy,

2013). Clustering ensemble methods usually consist

of two major steps: base clustering generation and

consensus fusion. The set of base clusterings can be

generated in different ways, which has been discussed

in the previous section. In this section, we provide a

brief review of several consensus fusion methods.

2.1 Semi-supervised Clustering

Ensemble

In this paper we propose the semi-supervised algo-

rithm that utilizes the side information (data obser-

vations with known labels). The algorithm calcu-

lates the association between each data point and the

training clusters (formed by the labelled data observa-

tions) and relabels the cluster labels in Φ

according

to the training clusters. In the context of this paper,

since the generation of base clusterings is based on

unsupervised clustering algorithms and the fusion of

base clusterings is guided by the side information, we

name the proposed algorithm as the semi-supervised

clustering ensemble algorithm (SEA). It consists of

two major steps: the base clusterings generation and

fusion. The base clustering generation step is com-

mon to the exisiting ensemble methods and summa-

rized in Table 1. For the base clustering fusion step,

we propose different version of the fusion function

to produce soft and hard consensus clustering respec-

tively.

2.2 Soft Semi-supervised Clustering

Ensemble Algorithm

Suppose the input data set X is the combination of

a training set X

and a testing set X

. The training

set X

contains data points {x

,...,x

}, for which

labels are provided in a label vector λ

. The testing

data set X

contains data points {x

,...,x

}, the

labels of which are unknown. The consensus clus-

ter label vector (output of SEA) for the test set X

denoted by λ

. The size of training set X

is the mea-

sure of the number of data points in the training set

and is denoted by N

, i.e., |X

| = N

. Similarly, the

size of testing set X

is the measure of the number

of data points in the testing set and is denoted by N

i.e., |X

| = N

. According to the training and testing

sets, the label matrix Φ can be partitioned into two

block matrices Φ

and Φ

, which contain all the la-

bels corresponding to the data points in the training

set X

and testing set X

respectively. Suppose train-

ing data points belong to K

classes and all training

points from the k-th class form one cluster, denoted

by C

(k = 1,...,K

). Therefore, the training set X

consists of a set of K

clusters {C

,...,C

If the size of cluster C

is denoted by N

, the total

number of training points N

is equal to

∑

k=1

. We

rearrange label matrix Φ

to form K

block matrices:

,...,Φ

. Each block matrix Φ

contains

the base cluster labels of data points in the k-th train-

ing cluster C

where k = 1,...,K

For a given set of base clusterings, the soft version

of the semi-supervised clustering algorithm (SSEA)

has the ability to provide a soft consensus cluster label

matrix. The fusion idea is stated as follow: (1) for a

particular data point count the number of agreements

between its label and the labels of training points in

each training cluster, according to an individual base

clustering (2) calculate the association vector between

this data point and the corresponding base clustering,

(3) compute the average association vector by averag-

ing the association vectors between this data point and

all base clusterings and (4) repeat for all data points

and derive the soft consensus clustering for the testing

set. The summary of SSEA is provided in Table 2.

According to the j-th clustering λ

( j)

, we compute

the association vector a

( j)

for the i-th unlabelled data

point x

, where i = 1,...,N

and j = 1,...,D. Since

there are K

training clusters, the association vector

( j)

has K

entries. Each entry describes the asso-

ciation between data point x

and the corresponding

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

260

Table 1: Base clusterings generation.

* Input: Data set X

* Output: Base clusterings Φ

(a) Select clustering algorithm and determine its initialization and param-

eter settings to build clusterer φ

( j)

(b) Apply clusterer φ

( j)

to data set X and obtain individual clustering λ

( j)

Table 2: Soft semi-supervised clustering ensemble algorithm (SSEA).

* Input: Base clusterings Φ

* Output: Soft clustering Λ

(a) According to label vector λ

, rearrange base clusterings Φ into K

+ 1 sub-

matrices {Φ

,...,Φ

,Φ

}

(b) For data point x

, calculate the k-th element of the association vector a

( j)

(k) =

occurrence ofΦ

(i, j) inΦ

(:, j)

and repeat for k = 1,...,K

to form the association vector a

( j)

of data point x

by a

∑

j=1

( j)

(d) Compute the association level γ

of data point x

to all training clusters by

∑

k=1

(k).

(e) Compute the membership information of data point x

to every cluster by

normalizing a

(f) Repeat step (b) to (d) to generate the association level vector γ

and repeat

step (b) to (e) to generate the soft clustering Λ

training cluster. The k-th entry of the association vec-

tor a

( j)

is calculated by the ratio of occurrence of

(i, j) in Φ

(:, j) to the number of data points in the

k-th training cluster (N

), i.e.,

( j)

(k) =

occurrence ofΦ

(i, j) inΦ

(:, j)

, (1)

where Φ

(i, j) is the cluster label of data point x

ac-

cording to the j-th base clustering and Φ

(:, j) rep-

resents the labels of all data points in the k-th train-

ing category generated by the j-th local clusterer. For

each data point x

, different association vectors a

( j)

(j = 1,...,D) are calculated since there are D local

clusterers in the system. In order to fuse the informa-

tion, the avearge association vector a

for data point

is computed by averaging all the association vec-

tors a

( j)

, i.e.,

∑

j=1

( j)

. (2)

Each entry of a

describes the consolidated associ-

ation between data point x

and one of the training

clusters. As a consequnce, the summation of all the

entries of a

could be used to describe the associa-

tion between data point x

and all the training clusters

quantitively. We deﬁne it as the association level of

data point x

to all the training clusters and denote it

as γ

, i.e.,

∑

k=1

(k). (3)

By computing the association levels for all the data

observations, the association level vector γ

for the

Semi-supervised Distributed Clustering for Bioinformatics - Comparison Study

261

Table 3: Soft to hard semi-supervised clustering ensemble algorithm (SHSEA).

* Input: Soft clustering Λ

* Output: Hard clustering λ

(a) Based on the average association vector a

, assign data point x

its most

assoicated cluster id, which corresponds to the highest entry in the average

association vector

(b) Repeat (a) for all i = 1,...,N

Table 4: Hard to hard semi-supervised clustering ensemble algorithm (HHSEA).

* Input: Base clusterings Φ

* Output: Hard clustering λ

(a) According to label vector λ

, rearrange base clusterings Φ into K

+ 1 sub-

matrices {Φ

,...,Φ

,Φ

}

(b) For data point x

, calculate the k-th element of the association vector a

( j)

(k) =

occurrence ofΦ

(i, j) inΦ

(:, j)

and repeat for k = 1,...,K

to form the association vector a

( j)

its most associated cluster ids, which corresponds to

the highest entry of association vector a

( j)

(d) According to the j-th clustering, repeat step (b) and (c) for all data points

(e) Repeat (b) - (d) for j = 1,...,D and relabel Φ

into Φ

′

(f) Apply majority voting on Φ

′

to derive hard consensus clustering λ

testing set X

is made up by stacking association level

for all i = 1,...,N

, i.e., γ

= [γ

,γ

,...,γ

]

. Let

us denote the soft consensus clustering of test set X

by a label matrix λ

. The i-th row of λ

is computed

by normalizing the average association vector a

, i.e.,

(i,:) = a

/γ

. (4)

2.3 Hard Semi-supervised Clustering

Ensemble Algorithm

In this section, we propose the hard version of the

semi-supervised clustering ensemble algorithm from

two approaches. The ﬁrst approaches is based on

calculating the average association vector a

for data

point x

. The consensus cluster label assigned to each

data point is its most associated category labels in the

corresponding average association vector. Since the

hard labels are derived from the soft label matrix Λ

it is named as the soft-to-hard semi-supervised clus-

tering ensemble algorithm (SHSEA). The summary of

this algorithm is provided in Table 3.

We also propose to derive hard consensus cluster-

ing from another approach. It is called hard to hard

semi-supervised clustering ensemble algorithm (HH-

SEA). The fusion idea stated as follow: (1) for a par-

ticular data point count the number of agreements be-

tween its label and the labels of training points in each

training cluster, according to an individual base clus-

tering, (2) calculate the association vector between

this data point and the corresponding base clustering,

(3) assign this data point to its most associated cluster

label (4) repeat for all data points and all base cluster-

ings to relabel the labels in matrix Φ

and (5) apply

majority voting to derive hard consensus clustering.

The summary of this algorithm is provided in Table 4.

3 NUMERICAL EXAMPLES

In this section, we provide numerical examples

to show the performance of our proposed semi-

supervised clustering ensemble algorithms: SHSEA

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

262

Table 5: Base Clusterings.

Individual base clustering No. of

Data

No. of Clustering No. of Base

features algorithms clusters Clusterings

Base 1 original F K-means k

( j)

> K

Base 2

Pre-processed

K-means/

( j)

> K

by PCA HAC/AP

Base 3

Pre-processed

pca

K-means

( j)

> K

by PCA

Base 4 original 1 K-means k

( j)

> K

Base 5 original 1 K-means k

( j)

= K

Base 6 original ⌈F/M⌉ K-means k

( j)

> K

Table 6: Average micro-precisions of SHSEA an HHSEA for different values of p using different sets of base clusterings.

p SHSEA HHSEA SHSEA HHSEA SHSEA HHSEA SHSEA HHSEA SHSEA HHSEA SHSEA HHSEA

3% 0.6351 0.4928 0.6282 0.4856 0.6363 0.4932 0.6374 0.3044 0.6150 0.3389 0.6282 0.4460

5% 0.6123 0.5170 0.6186 0.5150 0.6118 0.5162 0.6521 0.3838 0.6412 0.4570 0.6249 0.5139

10% 0.6530 0.5852 0.6551 0.5914 0.6558 0.5849 0.6645 0.5268 0.6521 0.5787 0.6702 0.6077

15% 0.6825 0.6269 0.6826 0.6324 0.6839 0.6277 0.7068 0.6072 0.7068 0.6974 0.6962 0.6455

20% 0.6900 0.6443 0.6830 0.6352 0.6933 0.6473 0.7275 0.6664 0.7264 0.6720 0.6983 0.6635

25% 0.7032 0.6579 0.7126 0.6636 0.7029 0.6578 0.7050 0.6659 0.6905 0.5879 0.7113 0.6848

30% 0.6868 0.6554 0.6918 0.6663 0.6866 0.6580 0.7274 0.6934 0.7232 0.6089 0.6994 0.6811

Table 7: Cancer data set: average micro-precisions of clustering algorithms (K-means, HAC and AP) on the original data sets

and the data pre-processed by PCA.

Data Sets

No. of

Dimensionality

Clustering Algorithms

MCLA

Data points Classes Kmeans HAC AP

3ClassesTest1 542 3

Original 705 0.4469 0.4299 0.4871 0.4989

PCA 100 0.4421 0.4354 0.5277 0.4487

and HHSEA using a real data set of breast cancer cells

undergoing treatment of different drugs. Since the ex-

pected cluster labels for each data set are available in

the experiments, we use micro-precision as our met-

ric to measure the accuracy of a clustering result with

respect to the expected labelling. Suppose there are k

classes for a given data set X containing N data points

and N

is the number of data points in the k-th cluster

that are correctly assigned to the corresponding class.

Corresponding class here represents the true class that

has the largest overlap with the k-cluster. The micro-

precision is deﬁned by mp =

∑

k=1

/N (Wang et al.,

2011). We arbitrarily construct test ﬁles using data

points from different classes by randomly choosing

training data points. According to the values of p,

we randomly select the required number of training

points from their corresponding classes to form the

training ﬁle. For each value of p, we create 10 ver-

sions of training ﬁle for each test ﬁle and repeat the

experiment 10 time using each version of the training

ﬁle. For each value of p, we generate six sets of base

clusterings for each test ﬁle (note that test ﬁles refers

to different classes provided: original breast cancer

cells, cancer cells 24 hours after the drug treatment,

and cancer cells 72 hours after the drug treatment).

Since the dimensionality of the original data set is

quite large (705 features commonly used in biochem-

istry software packages), we generate an additional

set of base clusterings using different combinations

of the features to generate base clusterings instead of

using a single feature each time. The detailed infor-

mation about how to generate these sets of base clus-

terings is provided in Table 5. Note that K

is the

number of classes from which training points are se-

lected, F is the dimensionality of the feature space,

and F

pca

is the number of principle componentswhich

can retain 95% of the total variation of the original

data and M = 21 is used in the experiments. ⌈·⌉ rep-

resents the ceiling function. The micro-precisions are

listed in Table 6 in which the columns correspond to

base clusterings listed in the table.

REFERENCES

Aggarwal, C. C. and Reddy, C. K. (2013). Data clustering:

algorithms and applications. CRC Press.

Basu, S., Banerjee, A., and Mooney, R. (2002). Semi-

supervised clustering by seeding. In In Proceedings

Semi-supervised Distributed Clustering for Bioinformatics - Comparison Study

263

of 19th International Conference on Machine Learn-

ing (ICML-2002. Citeseer.

Chapelle, O., Sch¨olkopf, B., Zien, A., et al. (2006). Semi-

supervised learning.

Dudoit, S. and Fridlyand, J. (2003). Bagging to improve the

accuracy of a clustering procedure. Bioinformatics,

19(9):1090–1099.

Fred, A. L. and Jain, A. K. (2005). Combining multiple

clusterings using evidence accumulation. IEEE trans-

actions on pattern analysis and machine intelligence,

27(6):835–850.

Ghaemi, R., Sulaiman, M. N., Ibrahim, H., and Mustapha,

N. (2009). A survey: clustering ensembles techniques.

World Academy of Science, Engineering and Technol-

ogy, 50:636–645.

Liu, Y., Jin, R., and Jain, A. K. (2007). Boostcluster: Boost-

ing clustering by pairwise constraints. In Proceedings

of the 13th ACM SIGKDD international conference

on Knowledge discovery and data mining, pages 450–

459. ACM.

Pickett, J. P. (2006). The American heritage dictionary of

the English language. Houghton Mifﬂin.

Shariff, A., Kangas, J., Coelho, L. P., Quinn, S., and Mur-

phy, R. F. (2010). Automated image analysis for high-

content screening and analysis. Journal of biomolec-

ular screening, 15(7):726–734.

Strehl, A. and Ghosh, J. (2003). Cluster ensembles—

a knowledge reuse framework for combining multi-

ple partitions. The Journal of Machine Learning Re-

search, 3:583–617.

Vega-Pons, S. and Ruiz-Shulcloper, J. (2011). A survey of

clustering ensemble algorithms. International Jour-

nal of Pattern Recognition and Artiﬁcial Intelligence,

25(03):337–372.

Wang, H., Shan, H., and Banerjee, A. (2011). Bayesian

cluster ensembles. Statistical Analysis and Data Min-

ing, 4(1):54–70.

Xu, R. and Wunsch, D. (2008). Clustering, volume 10. John

Wiley & Sons.

BIOSIGNALS 2017 - 10th International Conference on Bio-inspired Systems and Signal Processing

264