ARCHCOLLECT FRONT-END

A Web usage data mining knowledge acquisition mechanism focused on static or

dynamic contenting applications

Joubert de Castro Lima, Ahmed Ali Abdalla Esmin , Juvêncio Geraldo de Moura , Bruno Ferreira

Computer Science College – Facic/Fuom – Av.Dr.Arnaldo Senna, 328. Água Vermelha, Formiga – Brazil.

Tiago Garcia de Senna Carneiro

Federal University of Ouro Preto – Campus do Morro do Cruzeiro, Ouro Preto – Brazil.

eywords: Interactions, Data Warehouses, OLAP

Abstract: Knowledge acquisition mechanism is essencial to every Web usage mining project and it can be

implemented on the user or on all servers configuration. This paper presents a low coupled acquisition

mechanism focused on users’ interactions, associated with semantic data, binded to almost all markup

languages and with monitored application layout independence. This mechanism acquires knowledge only

from the Web browser. It separates the requests: one for the monitored application and the other for the

server called ArchCollect, and has a parser that automatically inserts the knowledge acquisition mechanism

into the static/dynamic user’s page. Like other acquisition mechanisms, the ArchCollect front-end is

scalable since it can deal with massive network traffic, adopting scalable ArchCollect servers or scalable

internal components. It is efficient since it reduces drastically the preprocessing, sharing this hard activity

with all users, and since it makes no log files interpretation or completation. It is realible since it eliminates

browser and server caches problems. This project can collect layout, usage and performance data, providing

general application focus, like Srivastava et.al proposed.

1 INTRODUCTION

Web usage mining tries to make sense of data

generated by the Web surfers´ sessions or behaviors.

The ArchCollect software is a typical usage mining

tool used to monitor users' interactions in several

environments, such as the Web, Telephony or

Interactive Digital TV. This article focuses only on

the Web media.

In general, projects aimed at monitoring the Web

usage have five main activities (Etzioni, 1999): 1) to

collect user interaction from the servers or from the

users, 2) to transform this interaction into useful

information to the projects, through data cleaning

and user transaction identification, 3) to store the set

of information into tables or into data structures,

typically, graphs, 4) to extract knowledge from the

information stored using content-based filtering and

collaborative filtering, and 5) to present online the

extracted knowledge with paralell coordinates or

scaterplots, for instance.

The information extracted from the acquired

interactions is useful in several activities: supporting

decisions on sites design, searching for justification

of content or layout changes (Spiliopoulou, M.,

2000), optimizing systems using cache or load

balancing policies (Perkowitz, 2000), offering

business or advertising intelligent decision services

(Buchner, 1998) and developing personalization

systems, called recommendation systems (Sarwar,

2000).

One of the most important goals of a web data

mining project is to have a reliable, efficient,

scalable, flexible and low coupled knowledge

acquisition mechanism (Shahabi, 2001), focused on

web usage data, performace data and layout dada

(Srivastava, 2000).

Collecting information only from the user only

permits that specific interactions, like a toolbar click

or a right mouse button click, to be acquired.

258

de Castro Lima J., Ali Abdalla Esmin A., Geraldo de Moura J., Ferreira B. and Garcia de Senna Carneiro T. (2004).

ARCHCOLLECT FRONT-END - A Web usage data mining knowledge acquisition mechanism focused on static or dynamic contenting applications.

In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 258-262

DOI: 10.5220/0002638402580262

 SciTePress

Interactions that characterize the end of a user’s

session are captured at the moment they happen,

with no need of timeouts or empiric searches for

interactions that best characterize users’ exits.

This paper presents a mechanism that can achieve

all goals above, and emphasizes the importance of

collecting interactions/clicks directly, associating

these interactions to semantic useful information and

not only collecting Web pages to be preprocessed.

Two other characteristics of the ArchCollect

project are the independence from the application

layout and the compatibility to almost all markup

languages.

Since it is possible to collect the user interaction

directly, it is also possible to form a dimensional

cube of information directly. This cube corresponds

to the ArchCollect knowledge, and can serve as an

accurate input for some information retrieval (IR)

approach.

A JavaScript function set forms the acquisition

framework that is inserted by a parser into the source

code of the monitored application. A set of

predefined events triggers the collection of relevant

information to the scope of a Web usage mining

tool: semantic information of the users’ interaction,

the occurrence page, the monitored interaction

layout, the user (HTTP header), the date and hour of

the interaction occurrence, the users’ page

permanence time and the users’ session.

The second section of this article approaches the

related works. Section 3 discusses the ArchCollect

acquisition mechanism. Finally, the conclusions are

drawn and the suggestions for future work are

proposed.

2 RELATED WORK

The WebSifit project can be mentioned (Cooley,

2000) as one of the most important in literature. The

project WebSifit works with log files generated by

the servers, clustering lines to form transactions,

which abstracts the idea of clusters of significant

references to each user. To obtain the interactions, it

is necessary the following preprocessing activities:

cleaning, user session identification and transaction

identification. Besides the preprocessing activities,

the activities of data integration and path

completation are important.

The projects (Gomory, Wu, Zaiane, 1999) have a

commercial stamp, offering data acquisition

mechanisms from the user and from several

monitored application servers.

There are also some projects similar to the

ArchCollect, which extract their information only

from the user. Projects as (Ackerman, 1997) and

(Lieberman, 1995) propose the use of agents for data

acquisition from the user, which demands significant

changes on the browsers.

The work (Shahabi, 2001) includes Java applets

at the user side for data acquisition. It emerges as

one of the most important contributions for the

acquisition mechanisms. The data acquisition model

consists of two components: 1) a remote agent, that

it is transferred from the Web server to the user

client as soon as the user loads the application first

page; and 2) a central acquisition server that emits

an unique and global identifier for each loaded agent

in the users. This server also collects and stores data

sent by the agents: the interaction occurrence page

and the user’s visit time.

The tools Alladvantage and NetZero are

examples of commercial solutions that work

collecting data only from the users.

3 THE ARCHCOLLECT

ACQUISITION MECHANISM

The ArchCollect data acquisition mechanism was

developed to collect interactions from specific

regions of the user’s browser. The coordinates (x,y)

of each interaction trigged by the onunload, onclick,

onmousedown or onload events are collected and

used by the mechanism illustrated by function

below. It maps the interaction coordinates into

distinct regions a, b or c .

F(x, y): a, if (x>0 ∨ y>0) ∧ a:(f,

elem, Σatrib, type), ∀ atrib ∈ elem

F(x, y): b, if (x>0 ∨ y<0) ∧

b:(toolbar, type)

F(x, y): c, if (x<0 ∨ y<0) ∧

c:(exit)

Function F(x, y) shows the regions into which the

user’s browser is divided: region a is defined as the

application area, region b as the toolbar area and

region c as the application exit area.

In region a, the application forms are found, as

well as several elements that compose it: buttons,

images, links, tables, text boxes etc. In the

ArchCollect acquisition mechanism, this region is

modeled as the triplet (f, elem, type) formed by the

frame f, the element elem where the interaction has

occurred and by the interaction type type.

Any interaction that happens in the form or in the

frame f is acquired, as well as the element elem,

belonging to the frame/form, and the interaction

type.

ARCHCOLLECT FRONT-END: A WEB USAGE DATA MINING KNOWLEDGE ACQUISITION MECHANISM

FOCUSED ON STATIC OR DYNAMIC CONTENTING APPLICATIONS

259

Figure 1: The distinct regions of a Web browser

In region a, there is also the possibility of

interaction through the right mouse button click to

view the page source code or to open the next page

from a link, for instance. This kind of interaction is

acquired by the ArchCollect with no extra

connections to the server. The interactions are

temporarily stored into a text field named

ArchCollectHiddenField, and they are sent when

the form is submitted. This mechanism guarantees

that all the interactions will be captured and sent to

the ArchCollect software analysis layer, without

damages for the user and for the monitored

application.

In region b, there is the toolbar, where the buttons

reload, back and forward are. Web users broadly use

this region, but interactions that occur here are not

acquired when data comes from log files, because

the browser cache is used. The ArchCollect

acquisition mechanism temporarily stores the

interaction through cookies, without creating extra

connections to the server, and sends the interaction,

together with the form, in the next connection to the

Web server. Region b is modeled as a pair (toolbar,

type).

In region c, there are the application exit button,

the exit option from the browser menu or, simply,

the keys ALT + F4. This last interaction type brings

benefits to the establishment of the end of each

user’s session, something extremely difficult and

accomplished with the use of empiric parameters.

Before closing the application, the acquisition

mechanism establishes one more connection to the

ArchCollect server to send the form, which has the

information related to the user’s exit interaction

stored into the ArchCollectHiddenField field. The

event onunload captures the exit interaction.

3.1 Main Code

The acquisition mechanism is based on listeners

that follow the DOM reference. Once an interaction

occurs the source code is interpreted:

function addlistener(element) {

element.onunload =

listenertoolbarandexit;

element.onclick = listener;

element.onmousedown =

listenerrightclicks;

element.onload = listenerloadtime;

}

//add listener to the DOM inicial

node

addlistener(window);

addlistener(document);

//add listeners to all links

for(var l = 0; l <

document.links.length; l++)

addlistener(document.links[l]);

//add lsiteners to all frames

for(var fr =0; fr <

window.frames.length; fr++)

addlistener(window.frames[fr]);

//add listeners to all forms

for(var f = 0; f <

document.forms.length; f++) {

addlistener(document.forms[f]);

//add listeners to all page elements

for(var e = 0; e <

document.forms[f].elements.length; e++)

addlistener(document.forms[f].elements[

e]);

}

Specific events call specific functions that act

based on monitor regions. The event onLoad collects

the entrance time and all other events will collect the

exit time, resulting in the permance time in a certain

page. MouseDown events are checked and if they

come from a right button click they will be

collected. Before submitting a page, the event

onUnload calls a function that checks if the

interaction comes from a toolbar or is an exit

interaction. Based on the interaction type, specific

code is interpreted. The onLoad event collects all the

interactions that occur on forms or on framesets.

ICEIS 2004 - SOFTWARE AGENTS AND INTERNET COMPUTING

260

This source code can be applied to dynamic or

static page layout, since the acquisition mechanism

is focused on general events and on page elements

properties like name, id, title, etc. Every node will

possess a listener following the hierarchy of the

DOM model, beginning with the node WINDOW,

followed by the nodes DOCUMENT,

FRAME/FORM, ELEMENT, LINK, etc.

The acquistion mechanism is inserted by a

parser between the tags <head> </ head>, what

guarantees it will be loaded before any application

element. The duplication component is responsible

for guaranteeing some tags existence on DHTML,

XHTML and XML pages.

3.2 Interaction Pattern

Project ArchCollect establishes the interaction

granularity to the dimension of the unique

interactions of unique users. For each interaction,

besides the occurrence page, any accessed element is

acquired. It is also possible to collect the semantic

information related to each interaction, in other

words, the clicked element possesses semantic

information of the type: shorts (age == 0.5, purchase

capacity == 0.5, resident == 0.5, where the value 0.5

means young, medium and resident of beach areas,

respectively). In the work (Lee, 2001), the authors

J.Y.Lee et al. present an interface centered in

subjective requirements on products. In the work

(Ghani, 2002) the importance of semantic

requirements related to the interactions is shown,

which drastically increases the accuracy of an

information retrieval approach.

Besides the semantic information about each

interaction, the acquisition mechanism collects data

on the layout of the interaction, the permanence time

of the user in a page, the time of occurrence of an

interaction, the user (HTTP header), the request

origin (referrer) and the user session.

Some other ArchCollect components collect

the ArchCollect and the monitored application

servers performance time and the network

interaction delay and all the collected information

defines the ArchCollect interaction pattern. For each

interaction, performance metrics (Menascé, 1998) as

the response time, waiting time, service time,

number of requests in the service waiting queue,

arrival rate, throughput and utilization of each

application server and of each ArchCollect server,

are collected . This way, it is possible to relate the

computational cost of each interaction to the volume

of businesses accomplished, in only one n-

dimensional cube of information. Figure 2 shows

the ArchCollect project interaction pattern.

A Data Warehouse can be implemented and some

multidimensional cubes can be formed with the

ArchCollect interaction pattern. Statistics models

can be done [Lima, 2003]. Also some data mining

aproaches can be used to correlate data. Clustering

algorithms can be used for segmentation.

Association algorithms can associate

products/services. Sequencial patterns reveal users’

interactions patterns. Similar times sequence

aproaches can be used to find purchases interactions

flutuation over a period of time.

The interaction pattern is based on the (Kimball,

2002) and (Lee, 2001) case studies, but other

business necessities may occur. New necessities are

implemented as semantic data.

4 CONCLUSIONS AND FURTHER

WORKS

The ArchCollect software will be available to the

general public under GNU GPL license terms at the

URL http://www.archcollect.org.

Preliminary tests showed that the acquisition

mechanism can be easily inserted into an application

pattern ::= user “+” date “+” page “+” element [“+”product] “+”permanence time

“+” interaction layout“+” referrer“+” session”+” semantic”+”performance

user ::= persistentcookie“+” IP“+” agent“+” protocol“+” host“+” adress“+” accept

date ::= day“+” month“+” year“+” hour“+” minute“+” second

page ::= name“+”id

element ::= name“+”title

product ::= quantity“+”price

I layout::= x position “+” y position

session ::= userIP[[“+” sessioncookie]“+” persistentcookie] “+” session permanence time“+” entrance page“+” exit page“+” sessionID

semantic::= age association”+” rent association”+” location association”+” preferernce association“+” etc.

performance::= ArchCollectWaitTime “+” OthersWaitTime “+” ArchCollectServiceTime “+” OthersServiceTime “+”

ArchCollectCollectingCompTime “+” ArchCollectLoadingCompTime “+” networkDelay

Figure 2: ArchCollect Interaction Pattern

ARCHCOLLECT FRONT-END: A WEB USAGE DATA MINING KNOWLEDGE ACQUISITION MECHANISM

FOCUSED ON STATIC OR DYNAMIC CONTENTING APPLICATIONS

261

without demanding manual updating on the source

code. The ArchCollect software current version

presents a significant improvement to the interaction

pattern and to the monitored application markup

language implementation, when compared to the

previous version (Lima, 2003).

A direct interaction acquisition mechanism can

generate fast and accurate n-dimensional

information cube used as a collaborative filtering

algorithm input, to establish some patterns based on

usage, layout, content and performance focus. None

of the related works collect the element page

directly, what compromises the accuracy, making

future preprocessing necessary.

The acquisition mechanism compatibility to the

browsers, because these assume that regions b and c

form a single region. For this browser type only the

interactions in region b are captured.

A possible limitation, not only to the ArchCollect

acquisition mechanism, but to every Web usage

mining tool that collects data from the user, is the

risk of the user disabling the monitoring in his/her

browser. That is valid for tools that use cookies, Java

applets or any plugin to be installed.

The collecting mechanism can be extended to

collect sounds and facial images produced by users,

and also to collect defined interactions as XML

elements/metatags from distributed systems.

Finally, it is necessary to extend the acquisition

mechanism compatibility, which is so far limited to

REFERENCES

Etzioni, O., 1999. The world wide web: Quaqmire or gold

mine. Communications of the ACM, 39(11):65-68.

Shahabi, C., Banaei-Kashani, F., Faruque, J., 2001. A

reliable, eficient, and scalable system for web usage

data acquisition. WebKDD'01 Workshop, ACM-

SIGKDD 2001, São Francisco, CA.

Spiliopoulou, M., 2000. Web usage mining for site

evolution: Making a site better fit its users. Special

Section of the Communications of ACM on

“Personalization Technologies with Data Mining”,

43(8):127-134, August, 2000.

Perkowitz, M., and Etzioni, O., 2000. Toward adaptive

Web sites: conceptual framework and case study.

Artificial Intelligence 118, p.p245-275, 2000.

Buchner, A.G., and Mulvenna, M.D., 1998. Discovering

Internet Marketing Intelligence through Online

Analytical Web Usage Mining. ACM SIGMOD

Record, ISSN 0163-5808, Vol. 27, No.4, p.p 54-61,

1998.

Sarwar, B.M., Karypis, G., Kostan, J.A., and Riedl, R.,

2000. Analysis of Recommender Algorithms for E-

Commerce. ACM E-Commerce’00 Conference.

October, 2000.

Srivastava, J., Cooley, R., Deshpande, M., Tan, P., 2000.

Web usage minig : Discovery and applications of

usage patterns from web data, SIGKDD, January,

2000.

Cooley, R., Tan, P., Srivastava, J., 2000. Discovery of

Interesting Usage Patterns from Web Data. Advances

in Web Usage Analysis and User Profiling, Lecture

Notes in Computer Science, Vol. 1836, Springer-

Verlag, 2000.

Gomory, S., Hoch, R., Lee, J., Poldlaseck, M., Schonberg,

E., 1999. Ecommerce Intelligence : Measuring,

Analyzing, and Reporting on Merchandising

Effectiveness of Online Stores, IBM Watson Research

Center.

Wu, K., Yu, P.S., and Ballman, A., 1999. SpeedTracer: A

web usage mining and analysis tool. IBM Systems

Journal, 37(1), 1999.

Zaiane, O.R., Xin, M., Han, J., 1999. Discovering Web

Access Patterns and Trends by Applying OLAP and

Data Mining Technology on Web Logs, Proc. of

Advances in Digital Libraries Conference, 1999.

Ackerman M. D., et al. 1997. “Learning Probabilistic user

profiles: Applications to finding interesting Web sites,

notifying users of relevant changes to the Web pages,

and locating grant opportunities”. AI Magazine 18(2)

47-56, 1997.

Lieberman H., 1995. “ Letizia: An agent that assists Web

browsing”. Proceedings of the international joint

conference on Artificial Intelligence, Montreal,

August 1995.

Alladvantage - http://www.alladvantage.com

NetZero - http://www.netzero.com

Lee, J., Lee, H.S., Wang, P., 2001. Design and

Implementation of a Visual Online Product Catalog

Interface. ICEIS (2) 2001: 1010-1017.

Ghani, R., Fano, A., 2002. Towards Semantic Data

Mining: Creating and Using a Knowledge Base of

Product Semantics, KDD 2002, Edmonton, Canada.

Menascé, Daniel A. & Almeida, Virgilio A.F., "Capacity

Planning for WEB Performance - Metrics, Models &

Methods", Prentice Hall, PTR, 1998.

Lima J.C., Carneiro T.G.S., Pagliares R.M., et. al, 2003.

ArchCollect: A set of Components directed towards

web users' interaction, ICEIS 2003 Conference,

Angers - France.

Lima J.C., Carneiro T.G.S., Esmin, A.A.A., 2003.

ArchCollect: concepts of na architecture that offers

services for analyzing and understanding Web users’

interactions. Conexão Ciência Magazine– Year I, n°

01, pp. 39-46 - Formiga – Brazil – 2003.

Kimball, R., 2002. Data Webhouse Toolkit. Editora

Campus.

ICEIS 2004 - SOFTWARE AGENTS AND INTERNET COMPUTING

262