ARCHCOLLECT FRONT-END
A Web usage data mining knowledge acquisition mechanism focused on static or
dynamic contenting applications
Joubert de Castro Lima, Ahmed Ali Abdalla Esmin , Juvêncio Geraldo de Moura , Bruno Ferreira
Computer Science College – Facic/Fuom – Av.Dr.Arnaldo Senna, 328. Água Vermelha, Formiga – Brazil.
Tiago Garcia de Senna Carneiro
Federal University of Ouro Preto – Campus do Morro do Cruzeiro, Ouro Preto – Brazil.
K
eywords: Interactions, Data Warehouses, OLAP
Abstract: Knowledge acquisition mechanism is essencial to every Web usage mining project and it can be
implemented on the user or on all servers configuration. This paper presents a low coupled acquisition
mechanism focused on users’ interactions, associated with semantic data, binded to almost all markup
languages and with monitored application layout independence. This mechanism acquires knowledge only
from the Web browser. It separates the requests: one for the monitored application and the other for the
server called ArchCollect, and has a parser that automatically inserts the knowledge acquisition mechanism
into the static/dynamic user’s page. Like other acquisition mechanisms, the ArchCollect front-end is
scalable since it can deal with massive network traffic, adopting scalable ArchCollect servers or scalable
internal components. It is efficient since it reduces drastically the preprocessing, sharing this hard activity
with all users, and since it makes no log files interpretation or completation. It is realible since it eliminates
browser and server caches problems. This project can collect layout, usage and performance data, providing
general application focus, like Srivastava et.al proposed.
1 INTRODUCTION
Web usage mining tries to make sense of data
generated by the Web surfers´ sessions or behaviors.
The ArchCollect software is a typical usage mining
tool used to monitor users' interactions in several
environments, such as the Web, Telephony or
Interactive Digital TV. This article focuses only on
the Web media.
In general, projects aimed at monitoring the Web
usage have five main activities (Etzioni, 1999): 1) to
collect user interaction from the servers or from the
users, 2) to transform this interaction into useful
information to the projects, through data cleaning
and user transaction identification, 3) to store the set
of information into tables or into data structures,
typically, graphs, 4) to extract knowledge from the
information stored using content-based filtering and
collaborative filtering, and 5) to present online the
extracted knowledge with paralell coordinates or
scaterplots, for instance.
The information extracted from the acquired
interactions is useful in several activities: supporting
decisions on sites design, searching for justification
of content or layout changes (Spiliopoulou, M.,
2000), optimizing systems using cache or load
balancing policies (Perkowitz, 2000), offering
business or advertising intelligent decision services
(Buchner, 1998) and developing personalization
systems, called recommendation systems (Sarwar,
2000).
One of the most important goals of a web data
mining project is to have a reliable, efficient,
scalable, flexible and low coupled knowledge
acquisition mechanism (Shahabi, 2001), focused on
web usage data, performace data and layout dada
(Srivastava, 2000).
Collecting information only from the user only
permits that specific interactions, like a toolbar click
or a right mouse button click, to be acquired.
258
de Castro Lima J., Ali Abdalla Esmin A., Geraldo de Moura J., Ferreira B. and Garcia de Senna Carneiro T. (2004).
ARCHCOLLECT FRONT-END - A Web usage data mining knowledge acquisition mechanism focused on static or dynamic contenting applications.
In Proceedings of the Sixth International Conference on Enterprise Information Systems, pages 258-262
DOI: 10.5220/0002638402580262
Copyright
c
SciTePress
Interactions that characterize the end of a user’s
session are captured at the moment they happen,
with no need of timeouts or empiric searches for
interactions that best characterize users’ exits.
This paper presents a mechanism that can achieve
all goals above, and emphasizes the importance of
collecting interactions/clicks directly, associating
these interactions to semantic useful information and
not only collecting Web pages to be preprocessed.
Two other characteristics of the ArchCollect
project are the independence from the application
layout and the compatibility to almost all markup
languages.
Since it is possible to collect the user interaction
directly, it is also possible to form a dimensional
cube of information directly. This cube corresponds
to the ArchCollect knowledge, and can serve as an
accurate input for some information retrieval (IR)
approach.
A JavaScript function set forms the acquisition
framework that is inserted by a parser into the source
code of the monitored application. A set of
predefined events triggers the collection of relevant
information to the scope of a Web usage mining
tool: semantic information of the users’ interaction,
the occurrence page, the monitored interaction
layout, the user (HTTP header), the date and hour of
the interaction occurrence, the users’ page
permanence time and the users’ session.
The second section of this article approaches the
related works. Section 3 discusses the ArchCollect
acquisition mechanism. Finally, the conclusions are
drawn and the suggestions for future work are
proposed.
2 RELATED WORK
The WebSifit project can be mentioned (Cooley,
2000) as one of the most important in literature. The
project WebSifit works with log files generated by
the servers, clustering lines to form transactions,
which abstracts the idea of clusters of significant
references to each user. To obtain the interactions, it
is necessary the following preprocessing activities:
cleaning, user session identification and transaction
identification. Besides the preprocessing activities,
the activities of data integration and path
completation are important.
The projects (Gomory, Wu, Zaiane, 1999) have a
commercial stamp, offering data acquisition
mechanisms from the user and from several
monitored application servers.
There are also some projects similar to the
ArchCollect, which extract their information only
from the user. Projects as (Ackerman, 1997) and
(Lieberman, 1995) propose the use of agents for data
acquisition from the user, which demands significant
changes on the browsers.
The work (Shahabi, 2001) includes Java applets
at the user side for data acquisition. It emerges as
one of the most important contributions for the
acquisition mechanisms. The data acquisition model
consists of two components: 1) a remote agent, that
it is transferred from the Web server to the user
client as soon as the user loads the application first
page; and 2) a central acquisition server that emits
an unique and global identifier for each loaded agent
in the users. This server also collects and stores data
sent by the agents: the interaction occurrence page
and the user’s visit time.
The tools Alladvantage and NetZero are
examples of commercial solutions that work
collecting data only from the users.
3 THE ARCHCOLLECT
ACQUISITION MECHANISM
The ArchCollect data acquisition mechanism was
developed to collect interactions from specific
regions of the user’s browser. The coordinates (x,y)
of each interaction trigged by the onunload, onclick,
onmousedown or onload events are collected and
used by the mechanism illustrated by function
below. It maps the interaction coordinates into
distinct regions a, b or c .
F(x, y): a, if (x>0 y>0) a:(f,
elem, Σatrib, type), atrib elem
F(x, y): b, if (x>0 y<0)
b:(toolbar, type)
F(x, y): c, if (x<0 y<0)
c:(exit)
Function F(x, y) shows the regions into which the
user’s browser is divided: region a is defined as the
application area, region b as the toolbar area and
region c as the application exit area.
In region a, the application forms are found, as
well as several elements that compose it: buttons,
images, links, tables, text boxes etc. In the
ArchCollect acquisition mechanism, this region is
modeled as the triplet (f, elem, type) formed by the
frame f, the element elem where the interaction has
occurred and by the interaction type type.
Any interaction that happens in the form or in the
frame f is acquired, as well as the element elem,
belonging to the frame/form, and the interaction
type.
ARCHCOLLECT FRONT-END: A WEB USAGE DATA MINING KNOWLEDGE ACQUISITION MECHANISM
FOCUSED ON STATIC OR DYNAMIC CONTENTING APPLICATIONS
259
c
b
a
Figure 1: The distinct regions of a Web browser
In region a, there is also the possibility of
interaction through the right mouse button click to
view the page source code or to open the next page
from a link, for instance. This kind of interaction is
acquired by the ArchCollect with no extra
connections to the server. The interactions are
temporarily stored into a text field named
ArchCollectHiddenField, and they are sent when
the form is submitted. This mechanism guarantees
that all the interactions will be captured and sent to
the ArchCollect software analysis layer, without
damages for the user and for the monitored
application.
In region b, there is the toolbar, where the buttons
reload, back and forward are. Web users broadly use
this region, but interactions that occur here are not
acquired when data comes from log files, because
the browser cache is used. The ArchCollect
acquisition mechanism temporarily stores the
interaction through cookies, without creating extra
connections to the server, and sends the interaction,
together with the form, in the next connection to the
Web server. Region b is modeled as a pair (toolbar,
type).
In region c, there are the application exit button,
the exit option from the browser menu or, simply,
the keys ALT + F4. This last interaction type brings
benefits to the establishment of the end of each
user’s session, something extremely difficult and
accomplished with the use of empiric parameters.
Before closing the application, the acquisition
mechanism establishes one more connection to the
ArchCollect server to send the form, which has the
information related to the user’s exit interaction
stored into the ArchCollectHiddenField field. The
event onunload captures the exit interaction.
3.1 Main Code
The acquisition mechanism is based on listeners
that follow the DOM reference. Once an interaction
occurs the source code is interpreted:
function addlistener(element) {
element.onunload =
listenertoolbarandexit;
element.onclick = listener;
element.onmousedown =
listenerrightclicks;
element.onload = listenerloadtime;
}
//add listener to the DOM inicial
node
addlistener(window);
addlistener(document);
//add listeners to all links
for(var l = 0; l <
document.links.length; l++)
addlistener(document.links[l]);
//add lsiteners to all frames
for(var fr =0; fr <
window.frames.length; fr++)
addlistener(window.frames[fr]);
//add listeners to all forms
for(var f = 0; f <
document.forms.length; f++) {
addlistener(document.forms[f]);
//add listeners to all page elements
for(var e = 0; e <
document.forms[f].elements.length; e++)
addlistener(document.forms[f].elements[
e]);
}
Specific events call specific functions that act
based on monitor regions. The event onLoad collects
the entrance time and all other events will collect the
exit time, resulting in the permance time in a certain
page. MouseDown events are checked and if they
come from a right button click they will be
collected. Before submitting a page, the event
onUnload calls a function that checks if the
interaction comes from a toolbar or is an exit
interaction. Based on the interaction type, specific
code is interpreted. The onLoad event collects all the
interactions that occur on forms or on framesets.
ICEIS 2004 - SOFTWARE AGENTS AND INTERNET COMPUTING
260
This source code can be applied to dynamic or
static page layout, since the acquisition mechanism
is focused on general events and on page elements
properties like name, id, title, etc. Every node will
possess a listener following the hierarchy of the
DOM model, beginning with the node WINDOW,
followed by the nodes DOCUMENT,
FRAME/FORM, ELEMENT, LINK, etc.
The acquistion mechanism is inserted by a
parser between the tags <head> </ head>, what
guarantees it will be loaded before any application
element. The duplication component is responsible
for guaranteeing some tags existence on DHTML,
XHTML and XML pages.
3.2 Interaction Pattern
Project ArchCollect establishes the interaction
granularity to the dimension of the unique
interactions of unique users. For each interaction,
besides the occurrence page, any accessed element is
acquired. It is also possible to collect the semantic
information related to each interaction, in other
words, the clicked element possesses semantic
information of the type: shorts (age == 0.5, purchase
capacity == 0.5, resident == 0.5, where the value 0.5
means young, medium and resident of beach areas,
respectively). In the work (Lee, 2001), the authors
J.Y.Lee et al. present an interface centered in
subjective requirements on products. In the work
(Ghani, 2002) the importance of semantic
requirements related to the interactions is shown,
which drastically increases the accuracy of an
information retrieval approach.
Besides the semantic information about each
interaction, the acquisition mechanism collects data
on the layout of the interaction, the permanence time
of the user in a page, the time of occurrence of an
interaction, the user (HTTP header), the request
origin (referrer) and the user session.
Some other ArchCollect components collect
the ArchCollect and the monitored application
servers performance time and the network
interaction delay and all the collected information
defines the ArchCollect interaction pattern. For each
interaction, performance metrics (Menascé, 1998) as
the response time, waiting time, service time,
number of requests in the service waiting queue,
arrival rate, throughput and utilization of each
application server and of each ArchCollect server,
are collected . This way, it is possible to relate the
computational cost of each interaction to the volume
of businesses accomplished, in only one n-
dimensional cube of information. Figure 2 shows
the ArchCollect project interaction pattern.
A Data Warehouse can be implemented and some
multidimensional cubes can be formed with the
ArchCollect interaction pattern. Statistics models
can be done [Lima, 2003]. Also some data mining
aproaches can be used to correlate data. Clustering
algorithms can be used for segmentation.
Association algorithms can associate
products/services. Sequencial patterns reveal users’
interactions patterns. Similar times sequence
aproaches can be used to find purchases interactions
flutuation over a period of time.
The interaction pattern is based on the (Kimball,
2002) and (Lee, 2001) case studies, but other
business necessities may occur. New necessities are
implemented as semantic data.
4 CONCLUSIONS AND FURTHER
WORKS
The ArchCollect software will be available to the
general public under GNU GPL license terms at the
URL http://www.archcollect.org.
Preliminary tests showed that the acquisition
mechanism can be easily inserted into an application
pattern ::= user “+” date “+” page “+” element [“+”product] “+”permanence time
“+” interaction layout“+” referrer“+” session”+” semantic”+”performance
user ::= persistentcookie“+” IP“+” agent“+” protocol“+” host“+” adress“+” accept
date ::= day“+” month“+” year“+” hour“+” minute“+” second
page ::= name“+”id
element ::= name“+”title
product ::= quantity“+”price
I layout::= x position “+” y position
session ::= userIP[[“+” sessioncookie]“+” persistentcookie] “+” session permanence time“+” entrance page“+” exit page“+” sessionID
semantic::= age association”+” rent association”+” location association”+” preferernce association“+” etc.
performance::= ArchCollectWaitTime “+” OthersWaitTime “+” ArchCollectServiceTime “+” OthersServiceTime “+”
ArchCollectCollectingCompTime “+” ArchCollectLoadingCompTime “+” networkDelay
Figure 2: ArchCollect Interaction Pattern
ARCHCOLLECT FRONT-END: A WEB USAGE DATA MINING KNOWLEDGE ACQUISITION MECHANISM
FOCUSED ON STATIC OR DYNAMIC CONTENTING APPLICATIONS
261
without demanding manual updating on the source
code. The ArchCollect software current version
presents a significant improvement to the interaction
pattern and to the monitored application markup
language implementation, when compared to the
previous version (Lima, 2003).
A direct interaction acquisition mechanism can
generate fast and accurate n-dimensional
information cube used as a collaborative filtering
algorithm input, to establish some patterns based on
usage, layout, content and performance focus. None
of the related works collect the element page
directly, what compromises the accuracy, making
future preprocessing necessary.
The acquisition mechanism compatibility to the
Netscape© browser is not complete. It cannot
capture the end of the session in Netscape©
browsers, because these assume that regions b and c
form a single region. For this browser type only the
interactions in region b are captured.
A possible limitation, not only to the ArchCollect
acquisition mechanism, but to every Web usage
mining tool that collects data from the user, is the
risk of the user disabling the monitoring in his/her
browser. That is valid for tools that use cookies, Java
applets or any plugin to be installed.
The collecting mechanism can be extended to
collect sounds and facial images produced by users,
and also to collect defined interactions as XML
elements/metatags from distributed systems.
Finally, it is necessary to extend the acquisition
mechanism compatibility, which is so far limited to
the Internet Explorer© and Netscape© browsers.
REFERENCES
Etzioni, O., 1999. The world wide web: Quaqmire or gold
mine. Communications of the ACM, 39(11):65-68.
Shahabi, C., Banaei-Kashani, F., Faruque, J., 2001. A
reliable, eficient, and scalable system for web usage
data acquisition. WebKDD'01 Workshop, ACM-
SIGKDD 2001, São Francisco, CA.
Spiliopoulou, M., 2000. Web usage mining for site
evolution: Making a site better fit its users. Special
Section of the Communications of ACM on
“Personalization Technologies with Data Mining”,
43(8):127-134, August, 2000.
Perkowitz, M., and Etzioni, O., 2000. Toward adaptive
Web sites: conceptual framework and case study.
Artificial Intelligence 118, p.p245-275, 2000.
Buchner, A.G., and Mulvenna, M.D., 1998. Discovering
Internet Marketing Intelligence through Online
Analytical Web Usage Mining. ACM SIGMOD
Record, ISSN 0163-5808, Vol. 27, No.4, p.p 54-61,
1998.
Sarwar, B.M., Karypis, G., Kostan, J.A., and Riedl, R.,
2000. Analysis of Recommender Algorithms for E-
Commerce. ACM E-Commerce’00 Conference.
October, 2000.
Srivastava, J., Cooley, R., Deshpande, M., Tan, P., 2000.
Web usage minig : Discovery and applications of
usage patterns from web data, SIGKDD, January,
2000.
Cooley, R., Tan, P., Srivastava, J., 2000. Discovery of
Interesting Usage Patterns from Web Data. Advances
in Web Usage Analysis and User Profiling, Lecture
Notes in Computer Science, Vol. 1836, Springer-
Verlag, 2000.
Gomory, S., Hoch, R., Lee, J., Poldlaseck, M., Schonberg,
E., 1999. Ecommerce Intelligence : Measuring,
Analyzing, and Reporting on Merchandising
Effectiveness of Online Stores, IBM Watson Research
Center.
Wu, K., Yu, P.S., and Ballman, A., 1999. SpeedTracer: A
web usage mining and analysis tool. IBM Systems
Journal, 37(1), 1999.
Zaiane, O.R., Xin, M., Han, J., 1999. Discovering Web
Access Patterns and Trends by Applying OLAP and
Data Mining Technology on Web Logs, Proc. of
Advances in Digital Libraries Conference, 1999.
Ackerman M. D., et al. 1997. “Learning Probabilistic user
profiles: Applications to finding interesting Web sites,
notifying users of relevant changes to the Web pages,
and locating grant opportunities”. AI Magazine 18(2)
47-56, 1997.
Lieberman H., 1995. “ Letizia: An agent that assists Web
browsing”. Proceedings of the international joint
conference on Artificial Intelligence, Montreal,
August 1995.
Alladvantage - http://www.alladvantage.com
NetZero - http://www.netzero.com
Lee, J., Lee, H.S., Wang, P., 2001. Design and
Implementation of a Visual Online Product Catalog
Interface. ICEIS (2) 2001: 1010-1017.
Ghani, R., Fano, A., 2002. Towards Semantic Data
Mining: Creating and Using a Knowledge Base of
Product Semantics, KDD 2002, Edmonton, Canada.
Menascé, Daniel A. & Almeida, Virgilio A.F., "Capacity
Planning for WEB Performance - Metrics, Models &
Methods", Prentice Hall, PTR, 1998.
Lima J.C., Carneiro T.G.S., Pagliares R.M., et. al, 2003.
ArchCollect: A set of Components directed towards
web users' interaction, ICEIS 2003 Conference,
Angers - France.
Lima J.C., Carneiro T.G.S., Esmin, A.A.A., 2003.
ArchCollect: concepts of na architecture that offers
services for analyzing and understanding Web users’
interactions. Conexão Ciência Magazine– Year I, n°
01, pp. 39-46 - Formiga – Brazil – 2003.
Kimball, R., 2002. Data Webhouse Toolkit. Editora
Campus.
ICEIS 2004 - SOFTWARE AGENTS AND INTERNET COMPUTING
262