
 
studied in different applications but most of the 
attention was towards non-arbitrary shape clustering 
like k-means (Mohammadi et al., 2014)(Gaddam et 
al., 2007). Arbitrary shape clustering methods 
preserve all samples for each cluster and to find the 
closet cluster to the new sample, distance of new 
samples to all cluster members is calculated. It is 
obvious that using such an approach is time 
consuming. The other way to find the closet cluster 
is to create a boundary for each cluster. If the new 
sample is inside the boundary of a cluster then the 
new sample belongs to that cluster. Finding the 
boundary of arbitrary shape clusters, especially in 
high dimensional problems, is a complex and time 
consuming process. Moreover, it is necessary to save 
too many faces to just keep borders of cluster 
created by convex in higher dimension, which grows 
exponentially with dimension (Kersting et al., 
2010)(Hershberger, 2009). 
In this paper, we propose a new approach that fulfils 
the mentioned requirements. We propose a 
summarization approach to summarize arbitrary 
shape clusters using Gaussian Mixture Model 
(GMM). In our approach, we first find the core 
objects of clusters and then we consider these core 
objects as centres of GMM and represent a cluster 
with a GMM. Since, GMM-based method keep all 
statistical information of each cluster, it summarizes 
each cluster in a way that we can use it for pattern 
extraction, pattern matching, and pattern merging. 
Moreover, this model is able to classify new objects. 
Using GMM, each new test sample is fed into the 
GMM of a cluster, and if the membership 
probability to a cluster is more than a threshold, the 
object is attached to that cluster.  
The structure of the paper is as follows: In Section 2, 
we review related work on arbitrary shape clusters 
and summarization approaches.  In Section 3, we 
explain the general structure of the proposed 
algorithm for summarization. In Section 4, we 
present some discussions about the features of the 
proposed method. In Section 5, we explain the 
complexity of algorithm in more detail. Section 6 
presents the experimental results of the proposed 
algorithm in comparison with well-known 
summarization algorithms. Finally, the conclusion 
and future work are presented in Section 7. 
2 RELATED WORK 
There are various algorithms available for clustering, 
which are categorized into four groups; partition-
based, hierarchical, density-based and spectral-based 
clustering (Han, 2006). K-means is one of the 
famous algorithms in the area of partition-based 
clustering. However, using a centre and radius 
makes the shape of clusters spherical which is 
undesirable in many applications. In hierarchical 
clustering methods such as Chameleon data is 
clustered in hierarchical form but still with spherical 
shape that is undesirable. Moreover, tuning the 
parameters for methods like Chameleon is still 
difficult (Karypis et al., 1999). Spectral clustering; 
STING (Wang et al., 1997) and CLUIQE (Agrawal 
et al. 1998) are able to create arbitrary shape clusters 
but the major drawback of these methods is the 
complexity of creating an efficient grid. The size of 
grid varies for different dimensions and setting 
different grid sizes and merging the grids to find 
clusters are difficult. These difficulties make the 
algorithm inaccurate in many cases. In the area of 
arbitrary shape clustering, density-based methods 
are more interesting and DBSCAN (Ester et al., 
1996) and DENCLUE (Hinneburg et al., 1998) are 
the most famous ones. In density-based methods, 
clusters are created using the concept of connecting 
dense regions to find arbitrary shape clusters. Based 
on prevalence of real time applications, there is more 
interest to make these algorithms fast for streaming 
applications (Guha et al., 2003)(Bifet et al., 
2009)(Charu et al., 2003). 
Summarization is the solution to ease the complexity 
of arbitrary shape clustering methods. The naïve 
way to represent an arbitrary shape cluster is to 
represent each cluster with all cluster members.  
Obviously, this approach is neither practical nor 
does it reflect the cluster properties. In k-means a 
simple representation using a centre and radius 
summarize the cluster. It is clear that this 
summarization does not capture how data is 
distributed in the cluster.  
There are different ways to summarize arbitrary 
shape clusters (Yang et al., 2011)(Cao et al., 
2006)(Chaoji et al., 2011). These algorithms use the 
general idea behind the clustering methods for 
arbitrary shape clusters. In the area of 
summarization, the idea is to detect dense regions 
and summarize the regions using core objects. Then, 
a set of proper features is considered to summarize 
the dense regions and their connectivity.  In (Yang et 
al., 2011) a grid is created for each cluster and based 
on the idea of connecting dense regions, the core or 
dense cells with their connections and their related 
features are kept. In all summarization approaches, 
these features play crucial role. In (Yang et al., 
2011) location and range of values and status 
connection vector are kept however, it has some 
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
44