WEB USAGE MINING USING ROUGH AGGLOMERATIVE

CLUSTERING

Pradeep kumar, P. Radha Krishna

Institute for Development and Research in Banking Technology, (IDRBT),

1, Castle hills, Masab Tank, Hyderabad - 500057

Supriya kumar De

XLRI Jamshedpur, C.H.Area(E ),

Jamshedpur,

INDIA

S Bapi Raju

University ofHyderabad,

Gochibowli,

Hyderabad,INDIA

Keywords: Data mining, rough sets, clickstream, web usage mining , similarity upper approximation.

Abstract: Tremendous growth of the web world incorporates application of data mining techniques to the web logs.

Data Mining and World Wide Web encompasses an important and active area of research. Web log mining

is analysis of web log files with web pages sequences. Web mining is broadly classified as web content

mining, web usage mining and web structure mining. Web usage mining is a techniques to discover usage

patterns from Web data, in order to understand and better serve the needs of Web-based applications. This

paper demonstrates a rough set based upper similarity approximation method to cluster the web usage

pattern. Results were presented using clickstream data to illustrate our technique.

1 INTRODUCTION

World Wide Web (WWW) is an unstructured

collection of pages and hyperlinks. People from

different backgrounds and interests access and

provide web pages. Application of data mining

approaches on World Wide Web is referred as web

mining. Web mining has attracted a lot of

researchers due to huge amount of active data

available on the World Wide Web. Broadly, web

mining tasks include web usage mining, web content

mining and web structure mining.

Web content mining is a process of discovering

nformation from millions of sources across the

World Wide Web. User interaction on the web are

recorded on a web logs. As each user interaction

corresponds to a mouse click it is oftenly referred as

clickstream. Web usage mining is performing

mining on web usage data or web logs. Extracting

patterns from on line information, such as HTML

files or E-mails is referred as web content mining.

Web content mining goes beyond basic Information

retrieval technology. Web structure mining is a

research field focused on using the analysis of the

link structure of the web, and one of its purposes is

to identify more preferable documents. The intuition

is that a hyperlink from document A to document B

implies that the author of document A thinks

document B contains worthwhile information.

Like conventional data mining clustering,

asso

ciation and sequential analysis are three

important operations in web mining. This paper

focuses on clustering, which is a unsupervised

learning method to partition a set of patterns into

315

kumar P., Radha Krishna P., kumar De S. and Bapi Raju S. (2005).

WEB USAGE MINING USING ROUGH AGGLOMERATIVE CLUSTERING.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 315-320

DOI: 10.5220/0002553003150320

 SciTePress

groups (Bezdek, J., 1981). To show the viability of

our approach we applied upper similarity

approximation to cluster clickstream transactions.

In this paper, we present an agglomerative

clustering approach using upper similarity

approximation for mining clickstream data.

Clickstream is a sequence of URLs browsed by a

user within a particular website in one session. To

discover the pattern of groups of users with similar

interest and motivation for visiting that particular

website can be found by clustering users’

clickstream on a particular website. A user session is

the clickstream of page views for a single user in the

website. We considered each user session as a

clickstream transaction, which contains the sequence

of URLs (or hyperlinks) of a visitor visiting a web

site.

A lot of research has been done in the area of

Web Usage Mining (Cooley, R., 2000, Spiliopoulou,

1999, Manco, G et.al., 2003) which directly or

indirectly addresses the issues involved in the

extraction of web navigational patterns (

Spiliopoulou, M. and Faulstich, L. C., 1999),

ordering relationships (Mannila, H. and Meek, C.,

2000), prediction of web surfing behavior ( Pitkow,

J and Pirolli, P., 1999), and clustering of web usage

sessions (Fu . et. al , 2000) based on web logs,

possibly supplemented by web content or structure

information. Perkowitz and Etzioni (Perkowitz and

Etzioni, 2000) proposed the idea of optimizing the

structure of web sites based on co-occurrence

patterns of pages within usage data for the site.

Spiliopoulou and Cooley (Spiliopoulou, 1999;

Cooley, R., 2000) have applied data mining

techniques to extract usage patterns from web logs,

for the purpose of deriving market intelligence.

Well-developed mining techniques cannot be

applied directly for web data as web logs being

unstructured in nature. Clustering in web mining

faces several additional challenges (Jhoshi, A. and

Krishnapuram , R., 1998).The specific problem of

web usage clustering has been studied over the past

few years. In (De and Radha Krishna, 2002),

automatic personalization of a web site from user

transactions using fuzzy proximity relations is

presented. In (De and Radha Krishna, 2004), a

clustering algorithm is presented using rough

approximation to cluster web transactions from web

access logs. Web clusters tends to have fuzzy

boundaries. It is likelihood that an object may be a

candidate for more than one clusters. To deal with

the special challenges found in web usage data a

non-conventional clustering approach using rough

set theory has been presented in (Hogo, M et al.

,2004). Pawan Lingras (Lingras, P., 2003) has used

rough set theory for web mining clustering.

The rest of the paper is organized as follows:

section 2 describes the basics of rough set theory. In

section 3, we present an approach for grouping

clickstream using upper similarity approximation.

Experimental results are presented in section 4 and

we conclude in section 5.

2 ROUGH SET THEORY

Zdzisław Pawlak introduced Rough set theory

(Pawlak ,1982) to deal with uncertainty and

vagueness. Rough set theory became popular among

scientists around the world due to its fundamental

importance in the field of artificial intelligence and

cognitive sciences. This section provides a brief

summary of the concepts of rough set theory. The

building block of rough set theory is an assumption

that with every set of the universe of discourse we

associate some information in the form of data and

knowledge.

Let U denote a universe and let R ⊆ U × U be a

equivalence relation on U. The pair A = ( U,R ) is

called an approximation space. The equivalence

relation R partitions the set U into disjoint subsets.

Such a partition of the universe is denoted by

U/R = ( E

,….,E

) , where E

is an equivalence

class of R.. If two elements u, v ∈ U belong to the

same equivalence class E ⊆ U/R, we say u,v are

indistinguishable. The equivalence classes of R are

called the elementary or atomic sets in the

approximation space A = ( U,R).

Within the same equivalence class it is not

possible to differentiate the elements. Hence, one

may not get a precise representation for an arbitrary

set X ⊆ U in terms of elementary sets in A. Rather

its upper and lower bounds may represent the set X.

Lower approximation A

(X) is union of all the

elementary sets which are subsets of X.

X) = { x ∈ U : ( x ) ⊆ X }

The upper bound ⎯A(X) is union of all the

elementary sets that have a non empty intersection

with X.

⎯A(X) = { x ∈ U : ( x ) ∩ X ≠ φ}

The pair (A(

X) ,⎯A(X) ) is the representation of an

ordinary set of X in the approximation space

A = ( U, R) or simply the rough set of X. Fig 1

illustrate the rough set approximation.

ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

316

Lower Approximation of Set

er A

roximation of the set

Set

uivalence Class

Figure 1: Rough set Approximation

3 CLUSTERING USING ROUGH

SETS

In this section, we present the agglomerative

clustering for clustering clickstream transactions

using upper similarity approximation. In rough set

theory, the lower approximation of a concept

consists of all objects that definitely belong to the

concept. The upper approximations of the concept

consist of all objects that possibly belong to the

concept. In our approach, we consider upper

approximation property to form clusters by defining

a similarity upper approximation.

We represent each transaction as a Jaccard

vector similarity function. The Jaccard similarity

penalizes on a small number of shared clicks. Let t

and s be the two clickstream transactions. The

similarity between the two transactions is computed

sim (t,s) =

∪

∩

Here, sim(t,s) Є (0,1). sim(t,s) will be equal to 1

when two transaction t and s are exactly same and

sim(t,s) is 0 when two transaction t and s are

completely dissimilar. The similarity measure

provides an idea of interest and motivation of users’

access pattern in their common area.

For a given threshold value, th ∈ (0,1), and for

any two user transactions t and s ∈ T, a binary

relation R on T denoted as tRs is defined by tRs iff

sim(t, s) ≥ th, where T is a set of all clickstream

transactions. The similarity class of t, denoted by

R(t), is the set of transaction which are similar to t is

given by R(t) = { s ∈ T, sRt }. For a fixed threshold

∈ (0, 1), a binary tolerance relation R is defined on

For clustering clickstream transactions, we

compute a similarity upper approximation as

follows:

Let t

∈ T be a user clickstream. The upper

approximation ⎯⎯R(t

) is a set of transactions similar

to t

, that is, a user, who is visiting the hyperlinks in

, may also visit the hyperlinks present in other

transactions in ⎯R (t

). Similarly, ⎯R⎯R(t

) is a set of

transactions that are possibly similar to ⎯R(t

), and

this process continues until two consecutive upper

approximations for t

are same. The process of

finding the two equal consecutive upper

approximations is known as Similarity Upper

Approximation and denoted by S

Initially, each clickstream transaction has been

considered as individual cluster. The similarity

upper approximation for each clickstream

transaction is calculated for a given clickstream

transaction data set. In each iteration of

agglomerative clustering, the clusters are

agglomerates based on the similarity upper

approximation. The process of computing similarity

upper approximation is repeated for each

transaction, until the two consecutive upper

approximations are same.

Let S

, S

,…S

be similarity upper approximation

for transaction t

, t

, …., t

respectively. Now, if

= S

(i and j are distinct) allocate t

and t

in the

same cluster. Performing this way, we get a

distribution of m disjoint clusters. Let these m

clusters be C

(j = 1, 2, …, m). Here, C

‘s are all

distinct and ∪C

= T. These C

s represent the sub-

groups of the transactions representing the

transaction cluster.

WEB USAGE MINING USING ROUGH AGGLOMERATIVE CLUSTERING

317

The algorithm for clustering clickstream

transactions is given below:

Algorithm: Rough Agglomerative

Clustering

Input: A set of n objects in a data set

U = {x

, x

, ….x

}, Threshold θ,

the number of clusters p ( ≤ n)

Output: Cluster scheme C

Step 1 : Start

Step 2 : Initially consider each

object of U as a cluster of one

member C

= { x

} and C = { C

, ., C

}

Step 3 : For each pair of clusters C

and C

calculate

sim ( C

, C

) = (C

∩C

)/(C

U C

)

Step 4 :For each cluster C

find out

the similarity upper

approximation S

for a given

threshold θ.

Step 5 : If S

= S

, form a new

cluster C

= C

∪ C

i.e. put x

and

in the same cluster.

Step 6: Update C

Step 7: Repeat Steps 5 and 6 till

there is no change in the number

of clusters.

Step 8 : Output C

Step 9. Stop

Let N be the total number of clickstream

transactions and L be the average length of the

transaction. The complexity of similarity

computation is in the order of O(N

log

L). Let R

be relation defined over T then the complexity of

upper approximation is in the order of O (T/R)

(Jamil and Jitender , 2001), which is same as

O(N/R). Merging of clusters takes place at each

iteration based on the similarity upper

approximation. Let k be the average number of

clusters merging in each iteration. The complexity of

merging k clusters is in the order of O (k

log k)

(Dash et.al., 2003) and there may be maximum of

N/k iterations. Thus, the complexity of merging

process is O((N/k) k log k ) = O (N

log k). So, the

complexity of rough agglomerative clustering is of

the order O (N

log

L)+ O (N/R) + O (N

log k).

To explain the approach, consider navigation

patterns of user visiting a e-commerce site shown in

transaction set T.

T ={ t

, t

}

={ Home, Login, Help, Logout }

= {Register, Regport, Results, Regform1,

Regform2 }

={ Catalog, Product, P_Info, AddCart }

={ Home, Login, Help, Fdback, Shelf, Promo,

Download, Logout }

={Register, Regform1, Results }

={Regport, Regform2 }

={ Fdback, Shelf, Promo, Download }

={ Charge, Pay_req, Pay_rem, Freeze}

={Catalog, Product, P_Info, Cart, AddCart }

={Charge, Pay_req, Pay_rem }

1 0 0 0.5

0 0 0 0 0

0 1 0 0 0.6 0.4 0 0 0 0

0 0 1 0 0 0 0 0 0.8 0

0.5 0 0 1 0 0 0.5 0 0 0

0 0.6 0 0 1 0 0 0 0 0

0 0.4 0 0 0 1 0 0 0 0

0 0 0 0.5 0 0 1 0 0 0

0 0 0 0 0 0 0 1 0 0.6

0 0 0.8 0 0 0 0 0 1 0

0 0 0 0 0 0 0 0.6 0 1

Table 1: Similarity Matrix

ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

318

Equivalent similarity matrix is shown in Table

1. Computing similarity upper approximation, we

get at threshold value 0.4, the equivalence classes

are

R(t

) = { t

, t

}, R(t

) = {t

, t

R(t

) = {t

, t

}, R(t

) = {t

, t

R(t

) = { t

, t

}, R(t

) = { t

R(t

) = {t

4 ,

}, R(t

) = {t

, t

R(t

) = {t

3 ,

}, R(t

) = {

8 ,

In the first step, we compute the upper

approximation of all ten transactions.

⎯R(t

) = { t

, t

},⎯R(t

) = {t

, t

⎯R(t

) = {t

, t

}, ⎯R(t

) = {t

, t

⎯R(t

) = { t

, t

}, ⎯R(t

) = { t

⎯R(t

) = {t

4 ,

}, ⎯R(t

) = {t

, t

⎯R(t

) = {t

3 ,

}, ⎯R(t

) = {

8 ,

Computing similarity upper approximation, we get

⎯R⎯R (t

) = { t

, t

}, ⎯R⎯R (t

) = {t

, t

⎯R⎯R (t

) = {t

, t

}, ⎯R⎯R (t

) = { t

, t

⎯R⎯R (t

) = {t

, t

}, ⎯R⎯R (t

) = {t

, t

⎯R⎯R(t

) = { t

, t

}, ⎯R⎯R (t

) = {

8 ,

}.,

⎯R⎯R (t

) = {t

, t

}, ⎯R⎯R (t

) = {

8 ,

⎯R⎯R⎯R ( t

) = { t

, t

⎯R⎯R⎯R ( t

) = {t

, t

⎯R⎯R⎯R ( t

) = {t

, t

⎯R⎯R⎯R ( t

) = { t

, t

Now, the process stops as two consecutive

upper approximations for each transaction is same.

Thus, the clusters formed are { t

, t

},{t

, t

},and {

8 ,

},that is, we have four clusters.

Since two or more clusters will agglomerate at each

stage the algorithm converges faster. Below we

describe the mean profile of each cluster.

Cluster1: It consists of three user navigation pattern

, t

. Although both the t

and t

has navigated

different set of pages but with respect to t

both has

navigated at least 40% similar pages.

Cluster 2: It consists of three user navigation pattern

, t

. All the three perform the same navigation

pattern at least 40% with respect to one another.

Cluster 3: It consists of two user navigation pattern

and t

. Both have them have navigated product

information site and their navigation pattern is at

least 40% similar.

Cluster 4: It consists of two user navigation pattern

and t

. Both have them have navigated product

information site and their navigation pattern is at

least 40% similar.

4 EXPERIMENTAL RESULTS

We implemented our approach using Java and

performed experiments on a 2.4 GHz, 256 MB,

Pentium-IV machine running on Microsoft Windows

XP 2002. We used the clickstream dataset

T40I10D100K(http://www.cs.helsinki.fi/u/goethals

/dmcourse/util.html.), a Hungarian on-line news

portal. The dataset contains 1,00,000 clickstream

transactions. This set can be generated using the

generator from the IBM Alamaden Quest Research

group(http://www.almaden.ibm.com/software/quest/

Resources/index.shtml). The clickstream transaction

dataset contains transaction as small as one click and

as large as thirty clicks. The average weighted length

of the clicks is 10.06. Intuitively, very small and

very large clickstreams may not provide any useful

information about the users’ navigation behavior.

Thus, transaction length having less than 5 clicks is

considered as a short transaction and transaction

length with greater than 15 clicks are considered as a

long transaction. In the preprocessing step, short and

long transactions are removed from the dataset.

Experiments are performed on preprocessed

dataset with 81,832 records. At threshold value 0.8

we got 1,131 clusters and it took around 15hours 58

minutes and 17 seconds. We randomly took 2000

records preprocessed it, at 0.29 thershold value we

got 154 clusters. Similarly, we took randomly

50,000 records preprocessed it , at threshold value

0.6 we got 1520 clusters.

WEB USAGE MINING USING ROUGH AGGLOMERATIVE CLUSTERING

319

5 CONCLUSIONS

Clustering is the task of grouping similar objects

into clusters. Hierarchical agglomerative clustering

approaches iteratively agglomerates the closest (or

similar) pair of clusters. In this work, we presented a

rough agglomerative clustering technique to cluster

clickstream transactions based on Upper similarity

approximation. We experimented our approach on a

clickstream dataset, which was collected from a

Hungarian on-line news portal. Each clickstream

transaction is of variable length. The presented

clustering technique is useful in discovering the

pattern of groups of users with similar interest and

motivation for visiting a particular website. This

study is also helpful in building-up adaptive web

server depending on the users’ behavior.

REFERENCES

Jhoshi, A. and Krishnapuram , R., “ Robust fuzzy

clustering methods to support web mining,

proceedings of the workshop on Data Mining and

Knowledge Discovery, SIGMOD ‘ 98, Seattle, pp.

15/1 – 15/8, June 1998.

Bezdek, J, C,. Pattern recognition and fuzzy objective

function algorithms, plenum Press , New York 1981.

Hogo, M., Snorek, M. and Lingras, P., Temporal versus

latest snapshot web usage mining using kohonen som

and modified kohonen som based on the properties of

rough sets theory , international journal on artificial

intelligence tools, vol. 13, no. 3 (2004) 569-591.

Cooley, R., Web Usage Mining: Discovery and

Applications of Interesting Patterns from Web data.

PhD thesis,Dept. of Computer Science, University of

Minnesota, May 2000.

Spiliopoulou, M. and Faulstich, L, C., WUM: A tool for

web utilization analysis. In Extended version of

Proc.EDBT Workshop WebDB’98, pages 184–203.

Springer Verlag, 1999.

Mannila, H. and Meek, C., Global partial orders from

sequential data. In Proc. 6th Intl. Conf. on

KnowledgeDiscovery and Data Mining (KDD2000),

pages 161–168, Aug 2000.

Pitkow, J. and Pirolli, P., Mining longest repeating

subsequences to predict world wide web surfing. In

Proc 2ndUSENIX Symposium on Internet

Technologies & Systems (USITS’99), Oct 1999.

Fu, Y., Sandhu, K. and. Shih, M., A generalization-based

approach to clustering of web usage sessions. In

Dash, M., Huan, L., Peter, S., KianLee, T.: Fast

Hierarchical Clustering and its Validation, Data and

Knowledge Engineering. 44(1) (2003) 109-138.

De, S, K., Radha Krishna, P.: Mining web data using

clustering technique for web personalization, Int. Jour.

of Computational Intelligence and Applications, 2(3)

(2002) 255-265.

De, S.K., Radha Krishna, P.: Clustering web transactions

using rough approximation, Fuzzy Sets and Systems

(2004) (In print).

Jamil, S., Jitender, S,D. : Concept Approximations Based

on Rough Sets and similarity Measures, International

Journal on Applied Mathematics and Computer

Science,2001,Vol.11,No.3,655 – 674.

Pawlak, Z., Rough Sets, International Journal of

Computer and Information Sciences, 11 (1982) 341-

356.

Perkowitz, M., Etzioni, O.: Towards adaptive web sites:

Conceptual framework and case study, Artificial

Intelligence, 118 (2000) 245-275.

Spiliopoulou, M.: Data mining for the web, In Principles

of Data Mining and Knowledge Discovery, Second

European Symposium, PKDD’99, (1999) 588- 589.

Manco, G., Ortale, R., and Sacca, D., Similarity-based

clustering of Web transactions, Symposium on

Applied Computing, Proceedings of the 2003 ACM

symposium on Applied computing pp. 1212 - 1216

ICEIS 2005 - ARTIFICIAL INTELLIGENCE AND DECISION SUPPORT SYSTEMS

320