A Stream Clustering Algorithm using Information Theoretic
Clustering Evaluation Function
Erhan Gokcay
Software Engineering Department, Atilim University, Incek, Ankara, Turkey
Keywords: Stream Clustering, Data Stream, Cluster Analysis, Information Theory, Distance Function, Clustering
Evaluation Function.
Abstract: There are many stream clustering algorithms that can be divided roughly into density based algorithms and
hyper spherical distance based algorithms. Only density based algorithms can detect nonlinear clusters and
all algorithms assume that the data stream is an ordered sequence of points. Many algorithms need to receive
data in buckets to start processing with online and offline iterations with several passes over the data. In this
paper we propose a streaming clustering algorithm using a distance function which can separate highly
nonlinear clusters in one pass. The distance function used is based on information theoretic measures and it
is called Clustering Evaluation Function. The algorithm can handle data one point at a time and find the correct
number of clusters even with highly nonlinear clusters. The data points can arrive in any random order and
the number of clusters does not need to be specified. Each point is compared against already discovered
clusters and each time clusters are joined or divided using an iteratively updated threshold.
1 INTRODUCTION
Rapid increase of Cloud Systems and Internet of
Things concept have started to generate a huge
amount of data as streams. The processing of stream
data requires different approaches as data mining
algorithms that require the whole data set for
processing are having problems in terms of memory
and speed. The data that is already collected and
called Big Data can also be considered as a stream
since reading and processing Big Data in one step is
not possible because of memory constraints. One of
the important data mining algorithms is clustering the
data into several similar groups. There are several
challenges that needs to be addressed here. One of the
problems is the unknown number of clusters in the
data set where many clustering algorithms require this
information before processing. Another problem is
the arriving order of the data. Each data point from
any cluster can arrive in any order. The evolving
nature of some real-time generated data features also
increases the difficulty of clustering such streams.
During the analysis no special processing is done
for categorical data. One of the basic assumptions in
this paper is that the total data set contains groups that
can be clustered offline; even theoretically; by a
clustering algorithm. The motivation of this
assumption is that although the data arrives as a
stream one at a time, the total set should contain
clusters otherwise no algorithm; streaming or not; can
cluster the data.
In this paper we propose an algorithm that
requires one pass over the previous data to cluster a
stream of data where total number of clusters are not
known and not needed. To separate clusters properly,
we need a distance metric between clusters. The
Cluster Evaluation Function (CEF) developed by the
author (Gokcay, 2002) is chosen as a metric since
CEF can cluster nonlinear data sets. Many distance
measures are using hyper spherical distances and
cannot cluster nonconvex data sets. The algorithm is
called CEFStream and it can analyze data streams
containing nonlinearly separated clusters.
The order of data points are considered as random.
The order of arrival should not change the results.
Each arriving data point is tested against the clusters
created so far. If the distance is closer than a threshold
then the point is joined to that cluster. If not, a
different cluster is created. After each addition, the
cluster distances are measured again to find out
cluster groups that need to be merged together.
The paper is organized as follows. Section 2 gives
information about previous work. Section 3
582
Gokcay, E.
A Stream Clustering Algorithm using Information Theoretic Clustering Evaluation Function.
DOI: 10.5220/0006786205820588
In Proceedings of the 8th International Conference on Cloud Computing and Services Science (CLOSER 2018), pages 582-588
ISBN: 978-989-758-295-0
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
introduces the CEFStream Clustering algorithm.
Section 4 presents simulation of the algorithm using
synthetic data. Finally Section 5 presents future work
and Section 6 presents conclusion.
2 RELATED WORK
In this section we briefly review different approaches
in clustering data streams.
In the past decades, many stream clustering
algorithms have been proposed based on adaption of
off-line algorithms, such as BIRCH (Tian, 1997),
CluStream (Charu, 2003), DenStream (Feng, 2006),
StreamKM++ (Marcel, 2012), etc. These algorithms
can be generally summarized into two steps: an online
learning step that derives a data abstraction from
input data stream, and an offline clustering step that
generates the final result. During the online learning
step, input data stream is reduced to smaller size
representations so that storing the whole data is not
needed. Obviously these representations should
represent the original distributions properly to be
effective. Different algorithms gave different names
to these abstract representations like “micro-clusters”
in DenStream, or “coreset” in StreamKm++. After
receiving new data, there is an online step which
determines micro-clusters and an offline step that
uses abstract data to finalize clustering. In this step,
K-Means (James, 1967) is used by BIRCH,
CluStream and StreamKM++, while DBSCAN
(Martin, 1996) is used by DenStream.
Although many stream clustering algorithms
adopt K-Means or DBSCAN in their offline
clustering step. However, both methods have their
disadvantages. K-Means needs to know the number
of clusters k, and tends to generate similar sized
spherical clusters. DBSCAN and similar methods
(Xu 2016; Lin 2009; Alazeez 2017) can detect a
suitable number of arbitrary shaped clusters, but it has
parameters to be decided, which is critical for the
result but difficult to decide.
A good review of stream clustering methods is
given in (Reddy, 2017).
3 STREAM CLUSTERING USING
CEF
Many clustering algorithms are using hyper
ellipsoidal distance functions resulting in clusters of
similar shape. One exception is density based
algorithms (Xu 2017; Chen 2016; Khan 2016;
Hassani 2016) where there is no assumption about the
shape of the clusters. The problem with density is
how to decide the density value so that points are
considered to be part of a cluster. In this paper the
Clustering Evaluation Function (CEF) developed by
the author using Information Theory is chosen. The
derivation is described in (Gokcay, 2002) and will not
be repeated here. The distance measured by CEF can
differentiate nonlinearly separated clusters and
therefore it is suitable to cluster data with highly
nonconvex regions. The CEF function defined for
two clusters is given in (1).

,
,=
1



−
,2
(1)
In (1),
and
are clusters of size
and
respectively where
∈
and
∈
. The
Gaussian kernel (G) needs a parameter σ for the
kernel size which should be determined using the
scale of the data.
3.1 Data Stream
The algorithm assumes that each data point
arrives
in a random order where () and
is the total number of points in the stream. Obviously
N is not restricted in a real data stream but this will
not change the assumption about random arrival.
Another assumption is that the data has d dimensions
and it is scaled between [-1 .. 1]. Although this seems
as a strong assumption, it would be easy to scale the
data online using a pre calculated scaling factor if the
data has known maximum and minimum limits. The
time stamp is not important in the analysis.
3.2 Data Structures
During processing we need to create several data
structures besides the data itself. For each generated
cluster
in the dataset, we have to store number of
points in the cluster, an array holding the CEF
distance to other clusters and data points that belong
to the cluster. The need to store the data stream will
increase the processing time since although there is
no requirement to copy all data to memory, disk
access time will increase total access and processing
time. As a future study to reduce the storage and
processing requirements, we can replace these points
with data skeleton centers (Gokcay, 2016).
A Stream Clustering Algorithm using Information Theoretic Clustering Evaluation Function
583
:
[1..
] (data points of
)
(the number of clusters in the dataset
discovered so far)
(number of data points in
)
[1..1] (CEF distances from current cluster
to
where ∈
[
1..
]
andp m)
The combined storage complexity of the data
stream is () which is proportional to the number
of points in the stream. The extra information that
needs to be stored has a complexity of () and the
distance calculated has a storage complexity of
(
) where is the number of clusters discovered
so far. We also assume .
3.3 CEFStream Algorithm
The CEFStream algorithm will use one pass over the
currently stored dataset. There is no online and offline
parts of the algorithm.
For each newly arrived sample point , the
distance to already formed clusters is measured once
as in (2).
(
)
=,
,
[
1..
]
(2)
The next step is to find the closest cluster as in (3).
CEF function is inversely proportional to the distance
between clusters. Therefore we have to maximize the
function.
= {
(
)
,
[
1..
]
}
(3)
The next step is to check if the closest cluster is
larger than a threshold T. If
(
)
> then the new
sample point is added to
. Otherwise a new cluster
is created.
After processing the sample point, the data
structure of each cluster needs to be updated. If a new
cluster is generated, M will be increased by one for
each cluster data structure. The size of CEF distance
array is also increased by one to include the new
cluster. The distance from the new cluster to other
clusters are added to the array. Since these distances
are already obtained in the previous step, there is no
need recalculate CEF distances again.
When a new sample point is added to an existing
cluster, M is not changed but the distances between
clusters still should be updated. Since the distance
from the new point to other clusters are already
obtained, we only need one step operation to update
each distance.
Once the new sample is processed, it is time to
check if there are clusters that need to be merged
together using the threshold T. Using the updated
distance array in each cluster, we will check if there
are clusters closer to each other than the threshold T
as in (4). The cluster joining operation will continue
until there are no clusters left to be joined.

,

,
> (4)
3.4 Threshold Estimation
The threshold T used in the calculation to differentiate
clusters from each other can be estimated using an
iterative calculation by averaging all distances
obtained. Since most of the points are close to each
other to form a cluster, the average will give a nice
estimation to decide if a point can be considered part
of a cluster or not.
3.5 Processing Complexity
The distance calculation using the new point is done
only once using the available dataset. Therefore the
algorithm can be considered as a one-pass algorithm
and the complexity is () where the total number
of points stored so far is . When there is a new
cluster, updating the existing array and adding the
distance of the new cluster to other clusters has a
complexity of (). When an existing cluster needs
to be updated by adding the sample point, we have to
recalculate all distances. This calculation needs one
iteration over all clusters and it should be repeated for
every cluster, hence it requires a complexity of
(
). Although the CEF distance calculation
between clusters
and
requires a complexity of

∗
, the complexity can be reduced to (1)
by using the already calculated distances and the
distance calculation of new point x.
Assume that we have a new point and we
already calculated the distances
(
)
,
[
1..
]
from the point to all clusters during cluster
assignment. Assume that the new point belongs to
cluster
and we have to update all distances from
to all other clusters. For example assume that we
want to update the distance 
,
from cluster
q to cluster k. When we update cluster q by adding x,
the new distance is given as (

,
).
Cluster C

has N
+1 points and cluster C
has
N
points. The computational complexity of CEF is
O(N
∗N
). The calculation can be simplified using
CLOSER 2018 - 8th International Conference on Cloud Computing and Services Science
584
the previously calculated distance iteratively. The
iteration is shown in (5).


,
=((
∗
∗
,
)
+
(
)
)/((
+1)∗
)
(5)
In the calculation 
,
is already
calculated in the previous iteration and
(
)
=

(
,
)
is calculated during one pass cluster
assignment operation. So the complexity of the
calculation is(1).
The final complexity is
(
)
∗(
) where N is
the # of points in the stream and M is the number of
clusters in the stream. Although N is not limited,
(
)
means that the algorithm is a one-pass
algorithm. Also we assume ≪ and not chancing
drastically (will not increase during the flow of the
stream), so that practically (
) can be considered
as constant.
3.6 Algorithm
The CEFStream algorithm is given below. As long as
there is a new sample, the calculation continues as
indicated by the while loop.
____________________________________
Algorithm 1: CEFStream algorithm.
____________________________________
Input : Data stream X
Output : Cluster Data Structure
while

∈

do
for each Cluster
,[1..]
D[k] = CEF(x,
)
update T
end
q = 
, [1..]
if (
(
)
> )
Join
and x
else
create a new cluster using x
end
for each Cluster
,[1..]
for each Cluster
,
[
1..
]
update
.(
)
end
end
for each Cluster
,[1..]
for each Cluster
, [1..]
if
.(
)> T
join
and
end
end
end
end
___________________________________
3.7 Outliers
There is no special processing for outliers like in
(Thakran, 2012). Outliers can be detected by testing
the number of points in each cluster. When the
number is below a certain threshold we can mark that
cluster as an outlier.
3.8 Evolving Data Streams
No special processing is used to track changing
clusters as the algorithm can track the clusters as long
as the change is not drastic and sudden. But if the
change of a cluster starts overlapping with other
clusters, the algorithm may combine these
overlapping clusters. To overcome this problem a
window can be applied to the data stream to slowly
discard old data.
4 EXPERIMENTS
The algorithm is tested using several synthetic
datasets. Each time the data arrival is randomly
modified to test whether the clusters depend on the
arrival order or not. Using a simple dataset given in
Fig. 1 the dependency to the order is tested.
Figure 1: Sample data set.
There are 10 points in the data set and each point
is taken from the set using a random permutation
simulating a stream. The result after each random
generation is the same with correct clustering as in
Fig. 2.
A Stream Clustering Algorithm using Information Theoretic Clustering Evaluation Function
585
Figure 2: Result of CEFStream showing 2 clusters.
The number of clusters generated during
iterations can be seen in Fig. 3. During the start of the
computation number of clusters increase, which is
expected as the sample order is random. After
processing several samples the clusters start to merge
and stays the same.
Figure 3: Number of Groups vs. Iterations.
The algorithm is tested using several times where
each time the arrival order is changed using a random
permutation as given in Table 1. The precision is
%100 each time which means that the result matches
with the original labels of the data.
Table 1: Different arrival orders of 10 sample points using
random permutations.
1 5 9 6 7 8 10 4 2 3
4 9 8 5 3 2 6 7 1 10
4 1 9 2 10 7 8 6 3 5
6 5 2 7 8 9 10 1 3 4
8 4 9 1 10 3 7 5 2 6
8 6 1 7 9 4 2 10 3 5
4.1 Test Results
The CEFStream algorithm is tested using several
different datasets and each resulting cluster is
displayed using a different color. The change of
number of groups is also given in Fig. 4. Each set is
tested several times using random permutations to
change the arriving order of points.
Figure 4: Test results with clusters and number of groups.
The algorithm is also tested with outliers where
each outlier is assigned to a different cluster which
will be very easy to eliminate by checking the cluster
size. The result is given in Fig. 5.
Figure 5: Outliers detected as different clusters.
CLOSER 2018 - 8th International Conference on Cloud Computing and Services Science
586
4.2 Difficult Datasets
There are cases where points from two different
clusters starts forming a single cluster during the early
phase of the iterations. For example occasionally the
dataset given in Fig. 6 generates clusters incorrectly
depending to the order of arrival.
Figure 6: Occasional incorrect clusters.
The original algorithm can merge similar clusters
but what we need is to separate a cluster into 2
clusters if the addition of a new point start forming a
different sub-cluster. The extra step will come with
additional computational complexity.
After adding the new point to a cluster, that cluster
will be tested if there is a need to separate the clusters
into two different clusters. For this operation we will
use the OptimalNumberofClusters algorithm
developed (Gokcay, 2017) to find the number of
clusters in any dataset. The derivation of the
algorithm will not be repeated here. The motivation is
not to determine the number of clusters but to test the
minimum point of the distance plot created by the
algorithm against the threshold T and separate the
clusters if necessary. Assuming that the points are
arriving one at a time, the running complexity of this
algorithm is (
) where q is the current cluster.
5 FUTURE WORK
As a position paper there is work that needs to be
completed. The threshold calculation needs to be
improved because in some cases the average
calculation may not be enough to detect the boundary
between clusters. The other improvement will be in
the incremental version of the data skeleton algorithm
to reduce the storage requirements. Also the
algorithm needs to be tested using real data sets as
well. Although the algorithm performs well against
random arrivals with nonlinearly separated synthetic
clusters, the case will be different with more
complicated cluster shapes.
6 CONCLUSIONS
In this paper we have developed a one-pass stream
clustering algorithm where the clusters are
independent of the arrival order and highly
nonconvex cluster distributions pose no problem. The
distance measure used in the algorithm can cluster
nonlinearly separable clusters efficiently. This is not
the case with K-means and all its derivatives since the
distance measure is hyper-ellipsoidal. No assumption
is needed about the possible number of clusters where
many algorithms require this number. Each new
sample point is processed once and a snapshot can be
taken from the algorithm at any time since there are
no on-line and off-line iterations.
REFERENCES
Alazeez, A., A., Jassim, S., and Du, H., 2017, EDDS: An
Enhanced Density-Based Method for Clustering Data
Streams, in 46th International Conference on Parallel
Processing Workshops (ICPPW), Bristol, 2017, pp.
103-112.
Charu A. C., et al. 2003, A framework for clustering
evolving data streams, in Proceedings of the 29th
international conference on Very large data bases-
Volume 29. VLDB Endowment, 2003.
Chen, J., He, H., 2016, A fast density-based data stream
clustering algorithm with cluster centers self-
determined for mixed data, In Information Sciences,
Volume 345, 2016, Pages 271-293
Feng, A. et al., 2006, Density-Based Clustering over an
Evolving Data Stream with Noise, in SDM. Vol. 6.
2006.
Gokcay E., Principe J. C., 2002, Information theoretic
clustering, in IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 24, no. 2, pp. 158-171,
Feb 2002.
Gokcay, E., Karakaya M., Bostan A., 2016, A new
Skeletonization Algorithm for Data Processing in
Cloud Computing, in UBMK-2016, First International
Conference on Computer Science and Engineering,
Çorlu, Turkey, 20-23 Oct, 2016.
Gokcay, E., Karakaya M., Sengul, G., 2017, Optimal
Number of Clusters, in ISEAIA 2017, Fifth
International Symposium on Engineering, Artificial
Intelligence & Applications, Girne, North Cyprus, 1-3
Nov, 2017.
Hassani, M., Spaus, P., Cuzzocrea A., and Seidl,T., 2016,
"I-HASTREAM: Density-Based Hierarchical
Clustering of Big Data Streams and Its Application to
Big Graph Analytics Tools," 2016 16th IEEE/ACM
A Stream Clustering Algorithm using Information Theoretic Clustering Evaluation Function
587
International Symposium on Cluster, Cloud and Grid
Computing (CCGrid), Cartagena, 2016, pp. 656-665.
James, M., 1967, Some methods for classification and
analysis of multivariate observations, Proceedings of
the fifth Berkeley symposium on mathematical statistics
and probability. Vol. 1. No. 14. 1967.
Khan, I., Huang, J., Ivanov K., 2016, Incremental density-
based ensemble clustering over evolving data streams,
In Neurocomputing, Volume 191, 2016, Pages 34-43
Lin J., and Lin, H., 2009, A density-based clustering over
evolving heterogeneous data stream, in ISECS Inter-
national Colloquium on Computing, Communication,
Control, and Management, Sanya, 2009, pp. 275-277.
Marcel A.R., et al. 2012, StreamKM++: A clustering
algorithm for data streams, in Journal of Experimental
Algorithmics (JEA) 17 (2012): 2-4.
Martin, E., et al, 1996, A density-based algorithm for
discovering clusters in large spatial databases with
noise, Kdd. Vol. 96. No. 34. 1996.
Reddy, K., and Bindu, C., S., 2017, A review on density-
based clustering algorithms for big data analysis, in
International Conference on I-SMAC (IoT in Social,
Mobile, Analytics and Cloud) (I-SMAC), Palladam,
2017, pp. 123-130.
Thakran Y., and Toshniwal, D., 2012, Unsupervised outlier
detection in streaming data using weighted clustering,
in 12th International Conference on Intelligent Systems
Design and Applications (ISDA), Kochi, 2012, pp. 947-
952.
Tian Z., Ramakrishnan R., and Livny, M., 1997, BIRCH: A
new data clustering algorithm and its applications. In
Data Mining and Knowledge Discovery 1.2 (1997):
141-182.
Xu, B., Shen F., and Zhao, J., 2016, Density Based Self
Organizing Incremental Neural Network for data
stream clustering, in 2016 International Joint
Conference on Neural Networks (IJCNN), Vancouver,
BC, 2016, pp. 2654-2661.
Xu, J., Wang, G., Li, T., Deng, W., Gou, G., 2017, Fat node
leading tree for data stream clustering with density
peaks, In Knowledge-Based Systems, Volume 120,
2017, Pages 99-117
CLOSER 2018 - 8th International Conference on Cloud Computing and Services Science
588