
settrainingindocumentsofnumbertotaltheisN
termiththecontainthatdocumentsofnumbertheisdfwhere
))
df
N
log(tf),....
df
N
log(tf),
df
N
log(tf(W
i
n
ntfidf
2
2
1
1
=
the input data and to develop an accurate model for 
each class using the features presented in the data. 
The class descriptions are used to classify future test 
data for which the class labels are unknown. Web 
document classification is an attempt to merge the 
quality and user-friendliness of directories with the 
popular ranked list presentation. By classification 
the results, increasing their readability by showing 
thematic groups, instead of a mixture of documents 
on all possible subjects matching the query.
 
Web Mining is an important field that aims to 
make good use of the information available on the 
web and find the data that was either previously 
unknown or hidden. An important step in the mining 
process is information retrieval and extraction. The 
retrieval and extraction methods differ in what 
aspect of a document is used in extraction 
information (Lan and Bing, 2003). In general there 
are two schools of thought; natural language 
processing techniques and techniques that use the 
structure of the web. Natural language processing 
techniques involve using the data of the web using 
string manipulation. Structural methods build a 
structure from the structure of the document itself. 
The research in web mining also derives from the 
research in other fields like natural language 
processing, artificial intelligence and machine 
learning. The techniques that are dealt in these fields 
mostly deal with a subset of the web pages. Efforts 
to combine the content and structure of a web page 
to build a model that is suitable for mining a wide 
variety of web documents are few and certainly 
insufficient. 
2.1 Centroid Technique 
In centroid-based classification algorithm, the Web 
documents are represented using the vector-space 
model (Salton, 1989) (Raghavan and Wong, 1986). 
In this model, each Web document is considered to 
be the term-frequency vector as following equation. 
 
 
                (1) 
 
 
A widely used refinement to this model is to 
weigh each term based on its inverse document 
frequency (IDF) in the Web document collection 
(Salton, Wong, and Yang, 1975). The motivation 
behind this weighting is that terms appearing 
frequently in many Web documents have limited 
discrimination power, and for this reason they need 
to be de-emphasized. This is commonly done by 
multiplying the frequency of each term i by 
log(N/df
i
 ), This leads to the tf-idf representation of 
the Web document as equation 2 . 
 
 
                (2) 
 
 
 
In order to account for documents of different 
lengths, the length of each Web document vector is 
normalized so that it is of unit length. Given a set N 
of N Web documents and their corresponding vector 
representations, the centroid vector (Han and 
Karypis, 2000) is described as equation 3.
  
 
                (3) 
 
Equation 3 is nothing more than the vector 
obtained by averaging the weights of the various 
terms presented in N Web documents. N is referred 
as the supporting set for the centroid. In the vector-
space model, the similarity between two Web 
documents      and      is commonly measured using 
the cosine function, given by equation 4 
 
 
          (4) 
 
 
The advantage of the summarization performed 
by the centroid vectors is that the computational 
complexity of the learning phase of this centroid-
based classifier is linear on the number of Web 
documents and the number of terms in the training 
set. Moreover, the amount of time required to 
classify a new Web document x is at most O(km), 
where  k  is the number of centroids and m is the 
number of terms present in x .  
2.2 Web Document Indexing 
In order to reduce the complexity of the Web 
documents and make them easier to handle, they 
have to be transformed to the vectors. The vector 
space model procedures can be divided in to three 
steps. The first step is content extraction where 
content bearing terms are extracted from each Web 
page. The second step is term weighting to enhance 
retrieval of Web document relevant to the user. The 
last step ranks the Web document with respect to the 
query according to similarity measure.  
vectors two  the of  product-dot  the  denotes “·” where
W*W
W.W
)W,Wcos(
ji
ji
ji
22
rr
rr
=
∑
∈
=
NW
tfidfC
tfidf
W
N
W
1
termiththeoffrequencytheistf
vectordocumentWebisWwhere
)tf,....tf,tf(W
i
tf
ntf
r
r
21
=
i
W
j
W
WEBIST 2005 - WEB INTERFACES AND APPLICATIONS
334