XHITS: LEARNING TO RANK IN A HYPERLINKED STRUCTURE

Francisco Benjamim Filho, Ra´ul Pierre Renter´ıa and Ruy Luiz Milidi´u

Department of Computing, Pontif´ıcia Universidade Cat´olica do Rio de Janeiro

Rua Marquˆes de S˜ao Vicente, 225, Rio de Janeiro, Brazil

Keywords:

Search engines, Keyword-based ranking, Link-based ranking.

Abstract:

The explosive growth and the widespread accessibility of the Web has led to a surge of research activity in the

area of information retrieval on the WWW. This is a huge and rich environment where the web pages can be

viewed as a large community of elements that are connected through links due to several issues. The HITS

approach introduces two basic concepts, hubs and authorities, which reveal some hidden semantic information

from the links. In this paper, we review the XHITS, a generalization of HITS, which expands the model from

two to several concepts and present a new Machine Learning algorithm to calibrate an XHITS model. The

new learning algorithm uses latent feature concepts. Furthermore, we provide some illustrative examples and

empirical tests. Our ﬁndings indicate that the new learning approach provides a more accurate XHITS model.

1 INTRODUCTION

Classiﬁcation plays a vital role in many infor-

mation management and retrieval tasks. On the

Web, the link structure provides valuable informa-

tion that can be used to improve information re-

trieval quality (Borodin et al., 2001),(Chakrabarti

et al., 2001),(Lempel and Moran, 2001),(Ding et al.,

2002a).

There are many different proposals for searching

and ranking information on the WWW, (Mendelzon

and Raﬁei, 2000), (Cohn and Chang, 2000), (Giles

et al., 2000), (yu Kao et al., 2003), (Fowler and Kara-

dayi, 2002), (Ding et al., 2002b), (Agosti and Pretto,

2005), (Mizzaro and Robertson, 2007),(Lempel and

Moran, 2001). Some proposals just improve the qual-

ity of existing ones by incorporating user behavior

data, (Agichtein et al., 2006),(Craswell and Szummer,

2007).

In a seminal paper (Kleinberg, 1999), Jon Klein-

berg introduced the notion of two fundamental cat-

egories of web pages: authorities and hubs. These

categories have a mutual reinforcement relationship

and to break it and classify the pages, Jon Kleinberg

proposed the HITS algorithm.

However, the extended Kleinberg’s approach,

XHITS (Filho, 2005),(Filho et al., 2009) introduces

new page categories and captures more individual

judgment information from the hyperlinked environ-

ment improving the page ranking.

Furthermore, with the inclusion of new categories,

emerged several parameters in the model to be cal-

ibrated. A learning process that uses a gradient de-

scendent method has been previously proposed to cal-

ibrate these parameters (Filho et al., 2009).

This paper focuses on the learning process,

proposing a novel algorithm and comparing it to the

previous one. For this, we introduce a new objective

function with two major components: one is the error

due to the Singular Value Decomposition(SVD) of the

inﬂuence matrix and the other is the average ranking

error rescaled through a sigmoid function. With this

new approach, we improved the ranking quality.

This paper is structured as follows. In section 2,

we summarize the XHITS modeling approach. In sec-

tion 3, we introduce a SVD Machine Learning pro-

cedure to calibrate the model. Next, in section 4,

we examine the empirical behaviour of the SVDLP

approach and compare it with Aproximate Learning

Process (ALP). Finally, in section 5, we state some

interesting consequences of our results and draw our

conclusions.

2 PAGE ROLE CLASSIFICATION

In this section, we summarize the XHITS algorithm

proposed in (Filho et al., 2009), extracting the main

structures and deﬁnitions, necessary to understand the

present paper.

385

Benjamim Filho F., Pierre Renteria R. and Luiz Milidiú R..

XHITS: LEARNING TO RANK IN A HYPERLINKED STRUCTURE.

DOI: 10.5220/0003632503770381

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 377-381

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

The XHITS model extends Hubs and Authorities

Model introducing k new categories, which are rep-

resented in each page with an addition of k weights.

These weights are reinforced through the links and

there are both forward and backward inﬂuences.

Furthermore, this approach presents the extended

model in the matrix form and uses a learning pro-

cess to calibrate the inﬂuence matrix M

, as can be

seen in equation (1),where σ represents the query, B

is the backward inﬂuence matrix, F is forward inﬂu-

ence matrix and A

is the adjacent matrix of the web

graph for the query σ.

The inﬂuence matrix reveals the combination of

the two sources of mutual inﬂuence: link propaga-

tion and category reinforcement. The special case of

symmetric reinforcement turns the matrix M

sym-

metric and the Power Method can be used for ﬁnding

the largest eigenvalue and a corresponding eigenvec-

tor for M

. After sorting the eigenvector, we have

the rank of the pages and can analyze the quality this

rank.

= (B ⊗ A

) + (F ⊗ A

) (1)

However, with the inclusion of new categories,

emerged several parameters in the model to be cali-

brated that inﬂuences the quality of web pages ranked.

These parameters are summarized in the matrix F, in

the special symmetric case B = F

and M

= M

, as

can be seen in the equation (2).

= (F ⊗ A

)

+ (F ⊗ A

) (2)

Another interesting case of the XHITS model is if

the matrices B and F are positive, the matrix M

also positive and the authors called this positive rein-

forcement. In this case, the Perron-Frobenius Theo-

rem asserts that the largest eingenvalue is positive and

there is also a corresponding eigenvector with posi-

tive coordinates and this is enough to guarantee con-

vergence of iteration.

In the way of ﬁnding the matrix F that optimizes

the rank quality, the authors proposed an Approximate

Learning Process (ALP) that minimizes training func-

tion (3), where C

is the cost function, Y

σj

is the ref-

erence rank and H

(F, A

) is the XHITS function.

The cost function C

is deﬁned as a quadratic dis-

tance between the reference rank Y

σj

and the gener-

ated rank H

(F, A

, which is differentiable and sim-

ple. But, during the differentiation process, the dif-

ferential of the H

(F, A

is a high computational cost

function and was adapted by the authors.

train

∑

σ=1

∑

j=1

σj

, H

(F, A

))) (3)

Finally, the minimization process is made using

the gradient descendent method. In the present work

we rebuilt the above training function (3) replacing

the approximate term and improve the quality of the

results.

3 XHITS LEARNING WITH

SVDLP

XHITS provides the vector r

with all page ranks,

by computing and sorting the eigenvector associated

with the largest eigenvalue of M

. Since M

depends

on F, as we change the values of F, the eigenvec-

tor value also modiﬁes, as well as the corresponding

ranks.

Hence, we develop a Machine Learning algorithm

to ﬁnd the value of F that maximizes ranking qual-

ity. The required matrix must be query independent,

generalizing what is found in the training set.

3.1 Learning Goal

Assume that we are given a training set T =

{( j, σ, o

jσ

)| j = 1, ··· , p and σ = 1, · · · , q}, where j is

a page-id, σ is a query-id and o

jσ

is the correct rank

of page j for query σ.

Let r

σj

be the XHITS rank of page j for query σ.

Our learning goal here is to ﬁnd F that minimizes the

training error E, given by

E =

∑

σ=1











∑

j=1

(

1+e

σj

−

1+e

σj

)

(cp)

∑

i=1

∑

j=1

σij

− b

σi

σj

)











(4)

Observe that the term

∑

j=1

(

1+ e

−o

σj

−

1+ e

−r

σj

)

(5)

corresponds to the average ranking error due to page

σ, where we rescale the rank values through a sigmoid

function.

Additionally, the term

(cp)

∑

i=1

∑

j=1

σij

− b

σi

σj

)

(6)

corresponds to the average largest eingenvalue esti-

mate error contribution, as in the Singular Value De-

composition(SVD) method (Brand, 2002),(Langville

et al., 2005), (Gorrell, 2006).

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

386

3.2 Gradient descent learning

To minimize E, we apply the gradient descent

method. Adapting the method for our purposes, than:

m+1

← b

− µ

∂E

∂b

(7)

m+1

← r

− µ

∂E

∂r

(8)

m+1

← F

− µ

∂E

∂F

(9)

Next, we show the partial derivatives of E with

respect to r

, b

and F, that is

∂E

∂r

∑

σ=1











∑

j=1

(

1+e

σj

−

1+e

σj

)(

σj

(1+e

σj

)

−

2α

(cp)

∑

i=1

∑

j=1

σ,i, j

− b

σi

σ j

).b

σi











(10)

∂E

∂b

∑

σ=1

−

2α

(cp)

∑

i=1

∑

j=1

σij

− b

σi

σ j

).r

σi

(11)

∂E

∂F

∑

σ=1

2α

(cp)

∑

i=1

∑

j=1

σij

− b

σi

σ j

∂M

σij

∂F

(12)

Finally, in the next section, we describe the algo-

rithm for the approach discussed here.

3.3 Algorithm

Now, we describe the approximate gradient descen-

dent learning algorithm as follows:

Begin

1 Initiates

and

with a random value.

2 Calculate

train

for every item of

the training set and if it is small

enough stop, else continue

3 Calculate

∂ E

train

∂ r

∂ E

train

∂ b

and

∂ E

train

∂ F

4 Calculate

m+1

and

m+1

5 Go back to step 2

End

4 EXPERIMENTAL RESULTS

In this section, we replicate the environment proposed

in (Filho et al., 2009) and compare the results with the

new approach.

4.1 Test Goal

Our major performance measure is ranking quality.

As a ﬁrst instance, we examine the XHITS model

with two more categories.

Our goal is to show that the SVD learning process

(SVDLP) provides a remarkable improvement over

approximate learning process (ALP).

4.2 Test Environment

We adopt the same scheme proposed in (Filho et al.,

2009) to build our benchmark. First, we ﬁx a set of

queries. There are 400 queries in the set with no over-

laps, derived from the most Google’s twenty-searched

topic for each day in a period of third days.

Cross-validation is a mainstay for measuring per-

formance and progress in machine learning. Conse-

quently, we kept the strategy and validated the ALP

and SVDLP algorithms using the 10-fold cross vali-

dation.

As in any learning process, we have to split up

the benchmark in two subsets and use one for train-

ing and other for tests. We randomly generate these

two subsets to avoid any interference or vicious on the

learning process.

For the reference rank needed to the learning pro-

cess, we made the same choice and are using an artiﬁ-

cial expert (as called in (Filho et al., 2009)): Google.

This choice was made for benchmarking compatibil-

ity purposes. But we already working in the new

benchmark to substitute the artiﬁcial expert and make

comparisons with other state of the art algorithms.

Finally, we considered the ﬁrst ten pages returned

by the expert as the relevant ones and we use the train-

ing set for ﬁne-tuning two matrices F using ALP and

SVDLP. After the training step, the matrices F chosen

are applied in the test set.

4.3 Test Results

There are lots of different metrics that can be used for

recommendation systems and information retrieval.

We believe that if we can return good pages in the ﬁrst

ten ones, it is a good way to analyze our results and

the metric P@10 reﬂects it. This metric represents the

precision of ten ﬁrst pages of results displayed. Sum-

marizing, we considered these as the relevant pages.

We summarize the test results in table 1 . We can

observe that there is already a considerable gain of

XHITS over the HITS, regardless of the learning pro-

cess (LP) in terms of P@10. The gain of XHITS

with the approximate learning process (ALP) is ap-

proximately 260%, and with the SVD learning pro-

XHITS: LEARNING TO RANK IN A HYPERLINKED STRUCTURE

387

Table 1: Precision at 10, HITS, XHITS ALP and XHITS

(SVDLP).

Algorithm Precision at Ten (P@10)

XHITS (ALP) 0.372455

XHITS (SVDLP) 0.519875

cess (SVDLP) is approximately 400%, both with re-

spect to the HITS. Another important fact that can be

drawn is the increased performance of XHITS with

the new approach of machine learning. We obtained

a 40% improvement with the new approach.

Looking inside the rankings, the best and worst

case of the proximity of the ranks produced by XHITS

SVDLP and Google was observed for query oprah

and the minimum for query michele bachmann. The

corresponding values were 0.9 and 0.1 in P@10. Dur-

ing the period we selected the queries, oprah, the star,

was about to reveal something involving her family

and nine of the ten ﬁrst pages matched with Google’s

ﬁrst ones. You can see the result in table 2.

Table 2: The ﬁrst ten links returned by XHITS engine after

the training

Position URL

1 http://www.oprah.com/

2 http://www.oprah.com/

omagazine.html

3 http://www.imdb.com/name/nm0001856/

4 http://www.tmz.com/person/

oprah-winfrey/

5 http://www.nydailynews.com/

topics/Oprah+Winfrey

6 http://oprahsangelnetwork.org

7 http://www.livingoprah.com/

8 http://bossip.com/category/

celeb-directory/oprah/

9 http://www.myspace.com/everything/

oprah-winfrey

10 http://www.quotationspage.com/

quotes/OprahWinfrey/

5 CONCLUSIONS AND FUTURE

WORK

We explored the fact that XHITS model provides a

powerful approach and rebuild the part of the model

that is an open problem: how to ﬁnd the set of pa-

rameters that best ﬁt to a given data set (Filho et al.,

2009). In the way to improve the model, a new learn-

ing process using SDV for the XHITS model has been

presented. Previous analysis and empirical results

have shown that SVDLP performs well in XHITS

model. SVDLP learns an higher precision XHITS

model, when compared to ALP. This approach has its

own beneﬁts, as follows:

• the SVDLP approach has no more approximate

steps;

• the training function is fully differentiable;

For testing the new approach, we chose Google as

our ranking expert, because we kept the compatibil-

ity with the previous learning process, and compared

the performance of HITS, XHITS ALP and XHITS

SVDLP in relation with each other. The gains of

XHITS SVD’ model over HITS’ are substantial as

shown in the experimental result, over 400 % gain

of quality or proximity of the Google’s ranking. We

are not afﬁrming that this gain reﬂects necessarily the

quality of the ranking, but it shows that we can learn

well a judged set of pages.

For future work, we are changing the benchmark

to the ClueWeb09 collection and comparing the per-

formance with other ranking algorithms already ex-

plored and reported in the literature.

REFERENCES

Agichtein, E., Brill, E., and Dumais, S. (2006). Improv-

ing web search ranking by incorporating user behav-

ior information. In SIGIR ’06: Proceedings of the

29th annual international ACM SIGIR conference on

Research and development in information retrieval,

pages 19–26, New York, NY, USA. ACM.

Agosti, M. and Pretto, L. (2005). A theoretical study of a

generalized version of kleinberg’s hits algorithm. Inf.

Retr., 8(2):219–243.

Borodin, A., Roberts, G. O., Rosenthal, J. S., and Tsaparas,

P. (2001). Finding authorities and hubs from link

structures on the world wide web. In Tenth Interna-

tional World Wide Web Conference.

Brand, M. (2002). Incremental singular value decomposi-

tion of uncertain data with missing values. In Pro-

ceedings of the 7th European Conference on Com-

puter Vision-Part I, ECCV ’02, pages 707–720, Lon-

don, UK, UK. Springer-Verlag.

Chakrabarti, S., Joshi, M., and Tawde, V. (2001). Enhanced

topic distillation using text, markup tags, and hyper-

links. In Proceedings of the 24th Annual International

ACM SIGIR Conference on Research and Develop-

ment in Information Retrieval, pages 208–216.

Cohn, D. and Chang, H. (2000). Learning to

probabilistically identify authoritative docu-

ments. http://citeseer.ist.psu.edu/438414.html;

http://www.andrew.cmu.edu/∼huan/phits.ps.gz.

Craswell, N. and Szummer, M. (2007). Random walks on

the click graph. In Proceedings of the 30th annual

international ACM SIGIR conference on Research

and development in information retrieval, SIGIR ’07,

pages 239–246, New York, NY, USA. ACM.

Ding, C., He, X., Husbands, P., Zha, H., and Simon, H. D.

(2002a). Pagerank, HITS and a uniﬁed framework for

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

388

link analysis. In Proceedings of the 25th Annual In-

ternational ACM SIGIR Conference on Research and

Development in Information Retrieval, Poster session,

pages 353–354.

Ding, C., Zha, H., Simon, H., and He, X. (2002b).

Link analysis: Hubs and authorities on the world

wide web. http://citeseer.ist.psu.edu/546869.html;

http://www.nersc.gov/research/SCG/cding/papers ps/

hits3.ps.

Filho, F. B. (2005). Xhits: Extending the hits algorithm

for distillation of broad search topic on www. Mas-

ter’s thesis, Pontif´ıcia Universidade Cat´olica do Rio

de Janeiro, Rio de Janeiro, Rio de Janeiro, Brazil.

Filho, F. B., Renter´ıa, R. P., and Milidi´u, R. L. (2009). Xhits

- multiple roles in a hyperlinked structure. In Fred, A.

L. N., editor, KDIR, pages 189–195. INSTICC Press.

Fowler, R. H. and Karadayi, T. (2002). Visualizing the web

as hubs and authorities richard H. fowler and tarkan

karadayi. http://citeseer.ist.psu.edu/551939.html;

http://bahia.cs.panam.edu/TR/TR CS 02 27.pdf.

Giles, C. L., Flake, G. W., and Lawrence, S.

(2000). Efﬁcient identiﬁcation of web commu-

nities. http://citeseer.ist.psu.edu/347042.html;

http://www.neci.nec.com/∼lawrence/papers/web-

kdd00/web-kdd00.ps.gz.

Gorrell, G. (2006). Generalized hebbian algorithm for in-

cremental latent semantic analysis. In Proceedings of

Interspeech.

Kleinberg, J. M. (1999). Hubs, authorities, and communi-

ties. ACM Computing Surveys (CSUR), 31(4es):5.

Langville, A. N., Carl, and Meyer, D. (2005). A survey

of eigenvector methods of web information retrieval.

SIAM Rev.

Lempel, R. and Moran, S. (2001). SALSA: the stochastic

approach for link-structure analysis. ACM Transac-

tions on Information Systems, 19(2):131–160.

Mendelzon, A. O. and Raﬁei, D. (2000). What

is this page known for? computing web page

reputations. http://citeseer.ist.psu.edu/295882.html;

ftp://ftp.db.toronto.edu/pub/papers/www9.ps.gz.

Mizzaro, S. and Robertson, S. (2007). Hits hits trec: ex-

ploring ir evaluation results with network analysis. In

SIGIR ’07: Proceedings of the 30th annual interna-

tional ACM SIGIR conference on Research and devel-

opment in information retrieval, pages 479–486, New

York, NY, USA. ACM.

yu Kao, H., ming Ho, J., syan Chen, M., and

hua Lin, S. (2003). Entropy-based link

analysis for mining web informative struc-

tures. http://citeseer.ist.psu.edu/572554.html;

http://kp05.iis.sinica.edu.tw/shlin/paper/CIKM02.pdf.

XHITS: LEARNING TO RANK IN A HYPERLINKED STRUCTURE

389