
 
(Beeferman and Berger, 2000) is another approach, 
which uses the "click-through data" method, forming 
a bipartite graph of queries and documents. 
However, it does not take into account the content 
features of both query and document, leading to an 
ineffective clustering. 
So far, several Web search results clustering 
systems have been implemented. We can mention 
four of them. Firstly, (Cutting et al., 1992) have 
created the Scatter/Gather system to cluster Web 
search results. This system is based on two 
clustering algorithms: Buckshot – fast for online 
clustering and Fractionation – accurate for offline 
initial clustering of the entire set.  This system has 
some limitations due to the shortcomings of the 
traditional heuristic clustering algorithms (e.g. k-
means) they used.  Secondly, (Zamir and Etzioni, 
1998) proposed in  an algorithm named Suffix Tree 
Clustering (STC) to automatically group Web search 
results. STC operates on query results snippets and 
clusters together documents with large common 
subphrases. The algorithm first generates a suffix 
tree where each internal node corresponds to a 
phrase, and then clusters are formed by grouping the 
Web search results that contain the same “key” 
phrase. Afterwards, highly overlapping clusters are 
merged. Thirdly, (Stefanowski and Weiss, 2003) 
developed Carrot
2
, an open source search results 
clustering engine. Carrot
2
 can automatically 
organize documents (e.g. search results) into 
thematic categories. Apart from two specialized 
document clustering algorithms (Lingo and STC), 
Carrot
2
 provides integrated components for fetching 
search results from various sources including 
YahooAPI, GoogleAPI, MSN Live API, eTools 
Meta Search, Lucene, SOLR, Google Desktop and 
more. Finally, (Zhang and Dong, 2004) proposed a 
semantic, hierarchical, online clustering approach 
named SHOC, in order to automatically group Web 
search results. Their work is an extension of O. 
Zamir and O. Etzioni's work. By combining the 
power of two novel techniques, key phrase 
discovery and orthogonal clustering, SHOC can 
generate suggestive clusters. Moreover, SHOC can 
work for multiple languages: English and oriental 
languages like Chinese. 
3 THEORETICAL FOUNDATION 
Clustering is one of the most useful tasks in the data 
mining process for discovering groups and 
identifying interesting distributions and patterns in 
the underlying data. The clustering problem is about 
partitioning a given data set into groups (clusters), so 
that the data points in a cluster are more similar to 
each other than points in other clusters. The 
relationship between objects is represented in a 
Proximity Matrix (PM), in which rows and columns 
correspond to objects. This idea is applicable in 
many fields, such as life sciences, medical sciences, 
engineering or e-learning.  
3.1  Clustering by Compression 
In 2004, Rudi Cilibrasi and Paul Vitanyi proposed a 
new method for clustering based on compression 
algorithms (Cilibrasi and Vitanyi, 2005). The 
method works as follows. First, it determines a 
parameter-free, universal, similarity distance, the 
normalized compression distance or NCD, computed 
from the lengths of compressed data files (singly and 
in pair-wise concatenation). Second, it applies a 
clustering method. 
The method is based on the fact that compression 
algorithms offer a good evaluation of the actual 
quantity of information comprised in the data to be 
clustered, without requiring any previous processing. 
The definition of the normalized compression 
distance is the following: if x and y are the two 
objects concerned, and C(x) and C(y) are the lengths 
of the compressed versions of x and y using 
compressor C, then the NCD is defined as: 
NCD(x,y) =
  
)}(),(max{
)}(),(min{)(
yCxC
yCxCxyC
               (1) 
The most important advantage of the NCD over 
classic distance metrics is its ability to cluster a large 
number of data samples, due to the high 
performance of the compression algorithms. 
The NCD is not restricted to a specific 
application. To extract a hierarchy of clusters from 
the distance matrix, a dendrogram (ternary tree) is 
determined by using a clustering algorithm. 
Evidence of successful application has been reported 
in areas such as genomics, virology, languages, 
literature, music, handwritten digits, and astronomy. 
The quality of the NCD depends on the 
performances of the compression algorithms. 
3.2  Classification Methods 
The goal of the classification methods is to group 
elements sharing the same information. 
Classification methods can be divided into the 
following three categories: distance methods, 
characters methods and quadruplets methods. 
In order to achieve the objective of this work, 
only the methods of distance and the methods of 
ICSOFT 2010 - 5th International Conference on Software and Data Technologies
294