A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB

PAGES IN INFORMATIONAL QUERIES RESULTS

Fabio Clarizia, Luca Greco and Paolo Napoletano

Department of Information Engineering and Electrical Engineering, University of Salerno

Via Ponte Don Melillo 1, 84084 Fisciano, Italy

Keywords:

Web search engine, Ontology, Topic model.

Abstract:

In this paper we present a new technique for retrieving relevant web pages in informational queries results.

The proposed technique, based on a probabilistic model of language, is embedded in a traditional web search

engine. The relevance of a Web page has been obtained through the judgment of human beings which, referring

to continue scale, have assigned a degree of importance to each of the analyzed websites. In order to validate

the proposed method a comparison with a classic engine is presented showing comparison based on a measure

of Precision and Recall and on a measure of distance with respect to the measure of signiﬁcance obtained by

humans.

1 INTRODUCTION

Modern Web search engines rely on keyword match-

ing and link structure (cfr. Google and its Page Rank

algorithm (Brin, 1998)), but the semantic gap is still

not bridged. Indeed, semantics of a web page is de-

ﬁned by its content and context; understanding of tex-

tual documents is still beyond the capability of today’s

artiﬁcial intelligence techniques and many multime-

dia features of a web page make the extraction and

representation of its semantics more difﬁcult. But

the most critical aspect regards the intrinsic ambi-

guity of language, which makes the task far harder:

when performing informational queries, existing web

search engines often show results not close enough

to user intentions. As well known any writing pro-

cess can be thought as a process of communication

where the main actor, namely the writer, encodes his

intentions through the language. Therefore the lan-

guage can be considered as a code that conveys what

we can call “meaning” to the reader that performs

a process for decoding it. Unfortunately, due to the

accidental imperfections of human languages, contin-

gent imperfections may occur then both encoding and

decoding processes are corrupted by “noise”, hence

are in practice ambiguous. Nevertheless Semantic

Web (Berners-Lee et al., 2001) and Knowledge En-

gineering communities are both confronted with the

endeavor to design different tools and languages for

describing semantics in order to avoid the ambiguity

of the encoding/decoding process. In the light of this

discussions speciﬁc language has been introduced,

RDF (Resource Description Framework), OWL (On-

tology Web Language), etc., to support the creator

(writer) of documents in describing semantic relations

between concept/words, namely by adding descrip-

tion on the document’s data: the metadata. During

such a process of creation all the elements of ambi-

guity should be avoided because of use of a shared

knowledge represented through the concept of ontol-

ogy. Actual web pages/resources are coded through

HTML and/or XHTML languages which, even per-

mitting minimal data descriptions, aren’t appropri-

ate, from a technological point of view, to provide

purely semantic description. As a consequence the

Web should be entirely re-written in order to seman-

tically arrange the content of each web pages, but

this process cannot be realized yet, due to the huge

amount of existent data and absence of deﬁnitive tools

for managing and manipulating those new languages.

In the meantime, waiting for the semantic web start-

ing, we could design tools for automatically revealing

and managing semantics of the previous web by using

methods and tools that don’t ground on any Web Se-

mantic speciﬁcation. In this direction several efforts

for providing models of language has been made by

the Natural Language Processing community, princi-

pally developed in a probabilistic framework (Man-

ning and Sch

utze, 1999). Actually also the Infor-

mation Retrieval community has focused its atten-

Clarizia F., Greco L. and Napoletano P. (2010).

A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB PAGES IN INFORMATIONAL QUERIES RESULTS.

In Proceedings of the 12th International Conference on Enterprise Information Systems - Information Systems Analysis and Speciﬁcation, pages 70-79

DOI: 10.5220/0002903100700079

 SciTePress

tion on probabilistic models of language, for instance

methods like the probabilistic Latent Semantic Index

(pLSI) (Hofmann, 1999) ﬁrst and the Latent Dirich-

let Allocation (LDA) (Blei et al., 2003) later, are well

know probabilistic methods for automatic categoriza-

tion of documents.

As mentioned above the existing Web search en-

gines primarily solve syntactic query, and as a side

effect the majority of search results do not completely

satisfy user intentions, especially when the queries

are informational. This work will show as a clas-

sic search engine can improve its search results, and

then bring them closer to the user intentions, using

a tool for automatic creation and manipulation of on-

tologies based on an extension of LDA, quoted above,

called the topic model. To this end, a new search

engine, iSoS, based on the existing open source soft-

ware Lucene, was developed. More details will be ex-

plained in the next sections together with experiments

aimed to make a comparison between iSoS and a clas-

sic engine behaviours . The comparing method relies

on an innovative human judgment based procedure,

which we broadly discuss next, and which represents

the real core of this paper.

2 iSoS: A TRADITIONAL WEB

SEARCH ENGINE WITH

ADVANCED

FUNCTIONALITIES

As discussed above, iSoS is a web search engine

with advanced functionalities; it’s a web based server-

side application, entirely written in Java and Java

Server Pages programming languages, which embeds

a customized version of the open source API Apache

Lucene

for indexing and searching functionalities.

In next sections we show the main properties and its

functionalities of iSoS framework. Some use cases

will be shown, including how to build a new index, in-

clude one or more ontologies, perform a query, build

a new ontology.

2.1 Web Crawling

Each web search engine works by storing informa-

tion about web pages retrieved by a Web crawler, a

program which essentially follows every link it ﬁnds

browsing the Web. Due to hardware limitations,

our application doesn’t implement its own crawling

system, but a smaller environment is created in or-

der to evaluate performance: the crawling stage is

http://lucene.apache.org/

performed by submitting a speciﬁc query to the fa-

mous web search engine Google (www.google.com),

and extracting the URLs from the retrieved results.

Then the application downloads the corresponding

web pages to be collected in speciﬁc folders and in-

dexed. The GUI allows users to choose the query and

the number of pages they want to index.

2.2 Indexing

The main aim of the indexing stage is to store statis-

tics about terms to make their search more efﬁcient.

A preliminary document analysis is needed in order

to recognize tag, metadata, informative contents: this

step is often referred as parsing. A standard Lucene

index is made of a document sequence where each

document, represented as an integer number, is a ﬁeld

sequence, with every ﬁeld containing index terms;

such an index belongs to the inverted index family

because it can list, for a term, the documents that

contain it. Correct parsing helps to make well cat-

egorized ﬁeld sets, improving subsequent searching

and scoring. There are different approaches to web

pages indexing. For example, it must be said that

some engines don’t index whole words but only their

stems. The stemming process reduces inﬂected words

to their root form and is a very common element

in query systems such as Web search engines, since

words with the same root are supposed to bring si-

miliar informative content. In order to avoid index-

ing of common words such as prepositions, conjuc-

tions, articles which don’t bring any additional in-

formation, stopwords ﬁltering can also be adopted.

Since stemmed words indexing and stopwords ﬁlter-

ing often result in a lack of search precision, although

they could help reducing the index size, they’re not

the choice of important search engines (like Google).

For this application, we developed a custom Lucene

analyzer which allows to index both words and their

stems without stopwords ﬁltering; it is possible than

to include in the searching process ontologies made

of stemmed words and thus optimize ontology-based

search without penalizing original query precision.

2.3 Searching and Scoring

The earth of a search engine lays in its ability to rank-

order the documents matching a query. This could be

done through speciﬁc score computations and ranking

policies. Several information retrieval (IR) operations

(including scoring documents on a query, documents

classiﬁcation and clustering) often rely on the vector

space model where documents are represented as vec-

tors in a common vector space (Christopher D. Man-

A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB PAGES IN INFORMATIONAL QUERIES

RESULTS

Table 1: An example of ontology for the topic Apple.

Word 1 Word 2 Relation factor

fruit popular 0.539597

fruit juic 0.530942

fruit mani 0.539458

fruit seed 0.531137

orchard popular 0.530548

cider popular 0.531391

mani tree 0.535011

mani fruit 0.539458

ning and Schtze, 2008). For every document d, a

vector

V (d) can be considered, with a component for

each dictionary term computed through tf-idf weight-

ing. The tf-idf weighting assings to term t a weight in

a document d given by tf-idf

t,d

= t f

t,d

× id f

, where

t f

t,d

is the term frequency and id f

is the inverse doc-

ument frequency deﬁned as id f

= log

d f

with N be-

ing the total number of documents and d f

being the

number of documents containing the term t. In this

model, a query can be also represented as a vector

V (q), allowing to score a document d by computing

the following cosine similarity:

score(q,d) =

V (q)·

V (d)

V (q)||

V (d)|

where the denominator is the product of the Eu-

clidean lengths and compensates the effect of docu-

ment length. Lucene embeds very efﬁcient searching

and scoring algorithms based on this model and on the

Boolean model. In order to perform ontology-based

search, we customized the querying mechanism to ﬁt

our needs. Since our ontologies are represented as

couples of related words where relationships strenght

is described by a real value (Relation factor), we used

Lucene Boolean model and term boosting faculties to

extend the base query with ontology contributions.

Table 1 shows an example of ontology represen-

tation for the topic Apple; including such an ontology

in the searching process would basically result in per-

forming:

((fruit AND popular)ˆ0.539597) OR

((fruit AND juic)ˆ0.530942)...

that means to search the couple of words fruit AND

popular with a boost factor of 0.539597 OR the cou-

ple of words fruit AND juic with a boost factor of

0.530942 and so on. The default boost factor is 1 and

is used for the original user query. As discussed be-

fore, the way documents are indexed allows to per-

form queries with full or stemmed words; ontologies

generated by the ontology builder tool, which will

be discussed in the next section, are always made of

stemmed words to organize information in a compact

fashion avoiding redundancy.

2.4 Ontology Builder

The Ontology builder is an automatic tool for con-

struction of ontology based on the extension of the

probabilistic topic model introduced in (T. L. Grif-

ﬁths, 2007) and (Blei et al., 2003). This method has

been deeply illustrated in (Colace et al., 2008), next

we will show the main idea behind it. The origi-

nal theory mainly asserts a semantic representation in

which word meanings are represented in terms of a set

of probabilistic topics z

where the statistically inde-

pendence among words w

and the “bags of words”

assumptions were made. The “bags of words” as-

sumption claims that a document can be considered

as a feature vector where each element in the vector

indicates the presence (or absence) of a word, where

information on the position of that word within the

document is completely lost. This model is genera-

tive and it allows to solve several problems, includ-

ing the word association problem, that is a fundamen-

tal for the automatic ontology building method. Such

a problem was studied for demonstrating what is the

role that the associative semantic structure of words

plays in episodic memory. In the topic model, word

association can be thought of as a problem of predic-

tion. Given that a cue if presented, what new words

might occur next in that context? By analyzing those

associations we can infer semantic relations among

words, moreover by applying this method for auto-

matic interpretation of a document, we can infer all

the semantic relations among words contained in that

document, as a result we could have a new represen-

tation of that document: what we call ontology.

Assume we will write P(z) for the distribution

over topics z in particular document and P(w|z) for

the probability distribution over word w given topic

z. Each word w

in a document (where the index

refers to ith word token) is generated by ﬁrst sam-

pling a topic from the topic distribution, then choos-

ing a word from the topic-word distribution. We write

P(z

= j) as the probability that the jth topic was sam-

pled for the ith word token, that indicates which top-

ics are important for a particular document. More,

we write P(w

= j) as the probability of word w

under topic j, that indicates which words are impor-

tant for which topic. The model speciﬁes the fol-

lowing distribution over words within a document,

P(w

) =

∑

k=1

P(w

= k)P(z

= k), where T is the

number of Topics.

In through the topic model we can build consis-

tent relations between words measuring their degree

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

Figure 1: iSoS GUI screenshot.

of dependence, formally by computing joint prob-

ability between words, P(w

) =

∑

k=1

P(w

k)P(w

= k).

In this model, the multinomial distribution repre-

senting the gist is drawn from a Dirichlet distribution,

a standard probability distribution over multinomials.

The results of LDA algorithm (Blei et al., 2003) , ob-

tained by running Gibbs sampling, are two matrix:

1. the words-topics matrix Φ: it contains the proba-

bility that word w is assigned to topic j;

2. the topics-documents matrix Θ: contains the

probability that a topic j is assigned to some word

token within a document.

By comparing joint probability with probability of

each random variable we can establishes how much

two variables (words) are statistically dependent, in

facts the hardness of such statistical dependence in-

creases as mutual information measure increases,

namely, ρ = log |P(w

) − P(w

)P(w

)|, where ρ ∈

[0,−1], after a normalization procedure. By select-

ing hard connections among existing all, for instance

choosing a threshold for the mutual information mea-

sure, a graph for the words can be delivered. As a

consequence, an ontology can be considered as set of

pair of words each of them having its mutual informa-

tional value, see Table 1.

3 iSoS IN PRACTICE

Figure 1 shows a screenshot of iSoS home page. The

GUI looks very simple, with few essential elements:

on the left side at the top there’s the main logo and be-

low the input query section, where the searching con-

text is also shown. The searching context refers to the

set of documents selected during the crawling stage

for a speciﬁc generator query. On the right side at the

top there are the language selection and the user man-

agement modules. This version allows to choose only

two languages: English and Italian; different stem-

ming algorithms and stopword sets are used depend-

ing on the chosen language. User registration can be

done through a speciﬁc form and is required to cre-

ate, store and recall user ontologies created with the

ontology builder tool; the lateral panel is organized in

three sections:

• Ontology management. This section allows to se-

lect one or more ontologies from the global set

or user customized set to add or remove them

from the searching process by clicking on add or

remove buttons respectively.The option Create a

new ontology allows registered users to launch the

Ontology Builder tool, which will be discussed in

the next subsection.

• Ontology display. The show button allows to dis-

play ontologies in their native tabular form. It is

also possible to show their graphic representation

by clicking on Show ontology graph.

• Advanced settings. By clicking on advanced set-

tings, other ﬁelds are displayed: set folder option

allows to change the searching context with an-

other available; create new index option allows

to create a new index by specifying the genera-

tor query, the number of Google search pages, the

number of results per page. The default context is

the last generated one.

Results, shown in the central part of the screen,

include URL links and Lucene scores.

3.1 Ontology Builder in Practice

By using this method for building ontology we are

able to catch the main topics treated in a document

and provide a unique structure for each document, for

more details on the main properties of this method see

(Colace et al., 2008). A user could declare his inten-

tions by writing a paper, even few lines, and thanks to

this automatic method we can extract the core of his

intentions. Alternatively one could build an ontology

providing a text that discusses the topics of which he

is interested in. At this point, whatever was the origin

of this ontology, you can use the same to resolve infor-

mational queries of a given context. We demonstrate

below that the use of this technique signiﬁcantly im-

proves the quality, in terms of relevance, of the results

obtained by our search engine.

This tool allows registered users to create their

customized ontologies by simply uploading a docu-

ment set they choose to describe a given topic. Figure

2 shows Ontology builder GUI; it looks very similiar

to iSoS main page, with a lateral panel organized in

three sections:

A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB PAGES IN INFORMATIONAL QUERIES

RESULTS

Figure 2: iSoS Ontology builder screenshot.

• Documents uploading. It’s possible for users to

import single documents (txt, HTML pages) or a

zip archive containing the document set.

• Ontology settings. The user can choose a thresh-

old for the relation factor and build the corre-

sponding ontology by clicking on the build but-

ton. Once the ontology has been built for the ﬁrst

time, the user can vary the number of nodes by

selecting a different threshold and by clicking on

the rebuild button.

• Ontology saving. The user can save the cus-

tomized ontology by choosing a name and click-

ing on the Save button. Stored ontologies are then

available in the ontology management section to

be included in the searching process.

4 PERFORMANCE EVALUATION

In order to evaluate search engines performance, dif-

ferent measures can be taken into account: query re-

sponse time, database coverage, index freshness and

availability of the web pages as time passes (Bar-

Ilan, 2004), user effort, retrieval effectiveness. Since

the main aim of this paper is to point out improve-

ments on retrieval effectiveness due to the introduc-

tion of ontologies into the searching process, we will

only concentrate on aspects regarding the relevancy

of top documents retrieved. When testing a particular

search engine, it can be useful to make comparisons

with behaviours coming from different engines in or-

der to put in evidence the strenghts and weaknesses

of the system under test. In this study, a ﬁrst evalua-

tion stage has been conducted by comparing iSoS be-

haviour with a Google Custom Search Engine (CSE);

the reasons for using Google CSE instead of Google

standard deal with iSoS limited crawling faculties: we

had to ensure that both engines performed searches

on the same corpus of web pages and Google CSE al-

lows to specify an URLs list of the Web pages to be

considered in the searching process. To evaluate re-

trieval effectiveness, the most commonly used crite-

ria are precision and recall (Christopher D. Manning

and Schtze, 2008): the former is deﬁned as the frac-

tion of retrieved documents that are relevant, the lat-

ter is the fraction of relevant documents that are re-

trieved. The huge and ever-changing domain of Web

systems makes impossible to calculate true recall,

which would require the knowledge of the total num-

ber of relevant items in the collection. So recall cal-

culation is often approximated or eventually omitted

(Heting Chu, 1996) when comparing search engines

performances. In this paper we use a modiﬁed recall

evaluation introduced by Vaughan (Vaughan, 2004)

in a previous work which relies on human judge-

ment. Unfortunately, as well known, the precision

and recall measure doesn’t take into account the sub-

jective relevance of the retrieved documents. Indeed

previous studies have emphasized that human sub-

jects can make relevance judgements on a continous

scale(Howard Greisdorf, 2001), so we found useful to

get through a second evaluation stage relying on con-

tinous human ranking which can be seen as the ideal

ranking reference and provides a better term of com-

parison between the systems being evaluated. In the

following sections we describe how the experimental

stage was carried out.

4.1 Topics and Search Queries Selection

When human relevance judgement is involved, a

large variety of factors can bias the results as the

concept of relevance is very subjective. Previous

studies have emphasized that relevance judgements

can only be made by people who have the original

information needs (Michael Gordon, 1999), so topics

and search queries selection should involve people

who make the judgement. Since ﬁfty people have

been involved in the judging task, ﬁve group of

ten people have been formed and each group has

deﬁned a common information need in Italian. As

a result, ﬁve topics and related queries were designed:

1. Topic: Arte Rinascimentale

Query: Arte Rinascimentale.

Query referred to later in this paper as AR.

2. Topic: Evoluzione della lingua italiana

Query: Evoluzione della lingua italiana.

Query referred to later in this paper as ELI.

3. Topic: Storia del teatro napoletano

Italian translation for Renaissance art

Italian translation for Evolution of italian language

Italian translation for History of Neapolitan theatre

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

Query: Storia del teatro napoletano.

Query referred to later in this paper as STN.

4. Topic: Storia dell’opera italiana

Query: Storia dell’opera italiana.

Query referred to later in this paper as OPI.

5. Topic: Origini della mozzarella

Query: Origini della mozzarella di bufala.

Query referred to later in this paper as OMB.

For each topic, three hundred pages have been

downloaded from the Web to be indexed by iSoS

search engine and their URLs have been used to pro-

gram a Google Custom Search Engine, allowing both

engines to perform searches on the same corpora of

documents.

4.2 Ontology Building and Web Pages

Retrieved

In order to feed ontology builder and produce on-

tologies for experiments, we asked ﬁve people with

great skill or knowledge about chosen topics to pro-

vide a set of documents better describing information

needs taken into account. Once ontologies have been

built, we have performed each query on both iSoS and

Google CSE. To better evaluate the ontology contri-

bution, we have also performed each query on iSoS

without including any ontology but with a query ex-

tension: we simply added ontology terms to the query.

Then, for each query we obtained 30 pages, corre-

sponding to the top 10 pages retrieved by each en-

gine, which were merged in order to make the set of

pages to be ranked by human subjects for that partic-

ular topic. As a result af the merge, the number of

pages in the sets AR, OPI, ELI, STN, OMB were 17,

19, 20, 19, 17 respectively.

Figure 3: OPI Borda Count results before meeting.

4.3 Human Ranking of the Web Pages

Subjects in the study were graduates in various dis-

ciplines ranging from engineering (information, elec-

Italian translation for History of italian opera

Italian translation for Origins of buffalo mozzarella

tronic, management) to the literature and economics

disciplines that were involved in a university training

course. People were divided into ﬁve group of ten

people each. Each group evaluated a set of documents

related to its own query. One must consider that the

need behind a web search could be not only informa-

tional but navigational (searching for the url of a spe-

ciﬁc site) or transactional (searching for e-commerce

sites...).

Figure 4: OPI Borda Count results after meeting.

Due to the kind of ontologies produced, when de-

ciding on relevancy the subjects were asked to judge

positively only informational Web pages without fol-

lowing any link on the page. Each subject had to

rank the pages according to his/her criteria and write

down those criteria. After that, each group met dis-

cussing about their ranking and criteria in order to

improve ranking quality and reduce the effects of un-

usual ranking by individual subjects. Finally, they

were allowed to adjust personal ranking if neces-

sary. In order to derive an average human behaviour,

the Borda Count method (Saari, 2001) was applied

to both before and after meeting results. Figure 3

shows OPI human ranked results before the meet-

ing together with a graphical representation of Borda

count method outcomes. Such a representation pro-

vides a simple way of investigating subjects degree of

agreement for each result with black being the max-

imum (all the eleven subjects agreed on the ranking)

and white being the minimum (nobody voted for that

position). Comparing the results with the ones shown

in Figure 4, which refers to the after meeting case,

more darker boxes laying closer to the diagonal line

of the graph can be seen, showing a better agreement

on the ﬁnal ranking as expected. Therefore, for all

the queries we chose as human being behaviour the

results coming from the Borda Count method applied

to the after meeting case. We decided to limit atten-

tion to the top 10 pages because typically users visit

only top pages retrieved (Silverstein et al., 1999). Ta-

bles 2,3, 4,5, 6 show top 10 human ranked results for

each query highlighting iSoS and Google CSE posi-

tioning for each page. Complete URLs of pages are

not reported due limitations in space.

A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB PAGES IN INFORMATIONAL QUERIES

RESULTS

Table 2: Results obtained for the query AR.

Human ranked URLs iSoS Google

1 www.artistiinrete.it 3 5

2 www.bilanciozero.net 2 > 10

3 it.encarta.msn.com 5 3

4 www.ﬁrenze-online.com 1 > 10

5 it.wikipedia.org 4 1

6 www.arte.go.it 8 > 10

7 digilander.libero.it 9 > 10

8 www.visibilmente.it 6 7

9 www.salviani.it > 10 4

10 www.arte-argomenti.org 7 > 10

Table 3: Results obtained for the query ELI.

Human ranked URLs iSoS Google

1 blogs.dotnethell.it 3 > 10

2 it.wikipedia.org 1 1

3 www.letteratour.it 2 > 10

4 www.nonsoloscuola.net 4 > 10

5 digilander.libero.it 7 > 10

6 xoomer.virgilio.it 6 > 10

7 www.etx.it 8 > 10

8 www.regione.emilia-romagna.it > 10 10

9 www.tesionline.com > 10 6

10 www.tesionline.it > 10 3

4.4 Precision and Recall Evaluation

Precision is a measurement always present in formal

information retrieval problems. As described before,

it is deﬁned as the fraction of retrieved documents that

are relevant so it depends on how relevancy judge-

ments were made. Binary relevance judgements are

often used, also in TREC experiments: a document

is relevant to the topic or it’s not (Voorhees, 2003).

Several studies have used multi-level rather then bi-

nary relevance judgements, but all these kinds of dis-

crete relevance scores suffer from the possibility of

giving the same score to different documents. So they

are not very useful when evaluating ranking results.

However, in this study we use a variant of standard

precision-recall evaluation based on human ranked re-

sults; in particular, we consider the top 10 human

ranked results as relevant and this assumption allows

us not to care about the real amount of relevant docu-

ments in the corpus, that would be very hard to calcu-

late in a precise manner. One must consider that our

target is to compare iSoS and Google CSE behaviours

with the human one, so this method appears to be very

effective although approximated. For each set of re-

sults, precision and recall values have been plotted to

give a precision-recall interpolated curve: the inter-

Table 4: Results obtained for the query OPI.

Human ranked URLs iSoSGoogle

1 it.wikipedia.org 4 10

2 www.jazzplayer.it 2 > 10

3 musicallround.forumcommunity.net 3 > 10

4 www.sonorika.com 5 > 10

5 www.sapere.it 6 > 10

6 www.bookonline.it 1 > 10

7 www.ilpaesedeibambinichesorridono.it 10 > 10

8 www.gremus.it > 10 1

9 www.bulgaria-italia.com 8 > 10

10 www.rodoni.ch 9 > 10

Table 5: Results obtained for the query STN.

Human ranked URLs iSoS Google

1 www.sottoilvesuvio.it 1 7

2 www.blackwikipedia.org 3 > 10

3 it.wikipedia.org/wiki 2 1

4 xoomer.virgilio.it 5 5

5 www.webalice.it 6 > 10

6 www.laboriosi.it 8 > 10

7 www.denaro.it > 10 4

8 www.gttempo.it 7 > 10

9 www.teatroantico.org 10 > 10

10 azzurrocomenapoli.myblog.it 4 > 10

polated precision p

interp

at a certain recall level r is

deﬁned as the highest precision found for any recall

level r

≥ r, p

interp

(r) = max

≥r

p(r

The traditional way of representing is the 11-point

interpolated average precision . For each information

need, the interpolated precision is measured at the 11

recall levels of 0.0, 0.1, 0.2, ..., 1.0. For each recall

level, we then calculate the arithmetic mean of the in-

terpolated precision at that recall level for each infor-

mation need in the corpus. A composite precision-

recall curve showing 11 points can then be graphed.

Figures 5, 6, 7, 8, 9 show precision-recall interpolated

graphs for the queries AR, ELI, OPI, OMB, STN re-

spectively. It can be seen that for all the examinated

cases iSoS displays a better performance than Google

CSE, with OPI being the best case. However, as said

before, precision and recall measurements rely on bi-

nary relevancy judgements so we don’t ﬁnd them ac-

curate enough to make a good comparison. In the next

section, we propose an alternative method to com-

pare iSoS and Google CSE behaviours with the hu-

man one.

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

Figure 5: Precision - Recall for AR query.

Figure 6: Precision - Recall for ELI query.

Figure 7: Precision - Recall for OPI query.

Figure 8: Precision - Recall for OMB query.

Figure 9: Precision - Recall for STN query.

Table 6: Results obtained for the query OMB.

Human ranked URLs iSoSGoogle

1 magazine.paginemediche.it 6 > 10

2 www.caseiﬁcioesposito.it 1 > 10

3 www.agricultura.it 2 > 10

4 it.wikipedia.org 3 2

5 www.mozzarelladibufala.org > 10 1

6 www.ciboviaggiando.it 9 6

7 www.sito.regione.campania.it 8 5

8 www.aversalenostreradici.com 4 > 10

9 www.tenutadoria.it 7 7

10 www.bortonevivai.it 5 8

4.5 Relevance Evaluation

A better performance evaluation can be done by com-

paring iSoS and Google CSE results with the human

ranking which we consider the reference behaviour.

Figure 10 provides a graphical representation of iSoS,

Google CSE and human being (HB) behaviours for

the query AR; being the reference, HB behaviour

is displayed as a blue line while iSoS and Google

CSE results are displayed as red and green curves re-

spectively. This representation allows to make a fast

visual comparison between the different behaviours:

the more a curve lays close to the blue line, the more

the search engine represented by that curve exhibits

an ideal behaviour. The graph region to be considered

spreads to the tenth result, as said before. The num-

bers next to the points on the blue line represent the

Mean Square Error evaluated from the subjects pref-

erences for that position. Another useful representa-

tion is shown in Figure 11 and displays the distance

of the HB ranking from iSoS ranking (red bar) and

from Google CSE ranking (green bar). This kind of

analysis has been also conducted for the other queries

and is showed in ﬁgures 12, 13, 14, 15, 16, 17, 18,

19. It can be seen that iSoS exhibits a better aver-

age behaviour for all the cases, with OPI being the

best one. Although these outcomes seem close to the

precision-recall ones, it must be said that this method

helps to better highlight also iSoS weaknesses, which

can be seen, for example, on the last three results for

the query ELI (Figure 13).

5 CONCLUSIONS

In this work we have seen how the current web search

engines are not able to resolve in an appropriate way

informational queries. We have also shown how the

results of a classic search engine can be improved

through the use of an innovative search technique

based on using a particular ontology. This ontology

A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB PAGES IN INFORMATIONAL QUERIES

RESULTS

Figure 10: AR comparison.

Figure 11: AR comparison.

is obtained by using a technique based on a model

of language called probabilistic topic model which is

based on the LDA technique, now important in the In-

formation Retrieval community. An ontology of this

type is able to capture clearly the main topic of in-

terest, so that when used in a simple informational

query task, it can certainly improve the quality of re-

sults returned. The proposed technique was validated

through a comparison with a classic search engine and

a comparison with a measure of signiﬁcance obtained

by experiments with human beings. The experiments

were conducted for different contexts and for each of

them were asked different groups of human beings to

assign judgments of relevance to the set of web pages

collected by unifying results obtained either with our

search engine and the classic one. The results ob-

tained have conﬁrmed that the proposed technique

Figure 12: ELI comparison.

Figure 13: ELI comparison.

Figure 14: OPI comparison.

certainly increases the beneﬁts in terms of relevance

of the results obtained, which in some way there-

fore respect the user intentions. We have also dis-

cussed about the opportunity of developing new tech-

niques for manipulating the language, mainly based

on probabilistic models, and consequently use those

technique for handle semantic information. This is an

important point of our discussion, because actually an

enormous quantity of information is on the web, but

few are the tools that are able to retrieve in this huge

set of information something that really respects users

intentions. To this end, the Semantic Web community

is working towards the introduction of new technolo-

gies and tools to make the enormity of the informa-

tion on the web handle. Despite this technological

paradigm shift has not happened yet and is not at the

gates, so in the meantime, tools such as those treated

Figure 15: OPI comparison.

ICEIS 2010 - 12th International Conference on Enterprise Information Systems

in this work can be really useful.

Figure 16: OMB comparison.

Figure 17: OMB comparison.

Figure 18: STN comparison.

Figure 19: STN comparison.

REFERENCES

Bar-Ilan, J. (2004). Methods for measuring search engine

performance over time. Journal of the American Soci-

ety for Information Science and Technology, 53(308–

319).

Berners-Lee, T., Hendler, J., and Lassila, O. (2001). The

semantic web. Scientiﬁc American, May.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3(993–1022).

Brin, S. (1998). The anatomy of a large-scale hypertextual

web search engine. In Computer Networks and ISDN

Systems, pages 107–117.

Christopher D. Manning, P. R. and Schtze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Colace, F., Santo, M. D., and Napoletano, P. (2008). A note

on methodology for designing ontology management

systems. In AAAI Spring Symposium.

Heting Chu, M. R. (1996). Search engines for the

world wide web: a comparative study and evaluation

methodology. In In Proceedings of the 59th annual

meeting of the American Society for Information Sci-

ence, pages 127–135.

Hofmann, T. (1999). Probabilistic latent semantic indexing.

In Proceedings of the Twenty-Second Annual Interna-

tional SIGIR Conference.

Howard Greisdorf, A. S. (2001). Median measure: an ap-

proach to ir systems evaluation. Information Process-

ing and Management, 37(6)(843–857).

Manning, C. D. and Sch

utze, H. (1999). Foundations of

statistical natural language processing. MIT Press,

Cambridge, MA, USA.

Michael Gordon, P. P. (1999). Finding information on the

world wide web: the retrieval effectiveness of search

engines. Information Processing and Management,

35(141–180).

Saari, D. G. (2001). Chaotic Elections! A Mathemati-

cian Looks at Voting. American Mathematical Soci-

ety, Providence.

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M.

(1999). Analysis of a very large web search engine

query log. ACM SIGIR Forum, 40(677–691).

T. L. Grifﬁths, M. Steyvers, J. B. T. (2007). Topics

in semantic representation. Psychological Review,

114(2):211–244.

Vaughan, L. (2004). New measurements for search engine

evaluation. Information Processing and Management,

40(677–691).

Voorhees, E. M. (2003). Overview of trec 2003. In In Pro-

ceedings of the 12th Text Retrieval Conference, pages

1–13.

A NEW TECHNIQUE FOR IDENTIFICATION OF RELEVANT WEB PAGES IN INFORMATIONAL QUERIES

RESULTS