D-RANK: A FRAMEWORK FOR SCORE AGGREGATION IN

SPECIALIZED SEARCH

Martin Vesel´y

LIA EPFL, CH-1015 Lausanne, Switzerland

Martin Rajman

LIA EPFL, CH-1015 Lausanne, Switzerland

Jean-Yves Le Meur

CERN, CH-1211 Geneva, Switzerland

Ludmila Marian

CERN, CH-1211 Geneva, Switzerland

J´erˆome Caffaro

CERN, CH-1211 Geneva, Switzerland

Keywords:

Specialized search engines, Score aggregation, Information retrieval.

Abstract:

In this paper we present an approach to score aggregation for specialized search systems. In our work we

focus on document ranking in scientiﬁc publication databases. We work with the collection of scientiﬁc publi-

cations of the CERN Document Server. This paper reports on work in progress and describes rank aggregation

framework with score normalization. We present results that we obtained with aggregations based on logistic

regression using both ranks and scores. In our experiment we concluded that score-based aggregation favored

performance in terms of Average Precision and Mean Reciprocal Rank, while rank-based aggregation favored

document discovery.

1 INTRODUCTION

Specialized search gains increasingly attention across

scientiﬁc communities. According to a recent study,

users of scientiﬁc information in the ﬁeld of parti-

cle physics often turn to specialized search services

such as arXiv.org

, SPIRES

, or the CERN Docu-

ment Server (CDS)

, rather than to general purpose

search engines when accessing scientiﬁc information

(Gentil-Beccot et al., 2008).

In the scope of specialized search, the traditional

http://arXiv.org/

http://www.slac.stanford.edu/spires/

http://cds.cern.ch/

notion of relevance is often extended to incorporate

additional attributes to score and rank documents at

a search engine output. When searching for scien-

tiﬁc documents, ranking attributes are traditionally

based on citations or previous document usage such

as ”reads” or document access frequency. The in-

tuition is that citing or reading a document by peers

shows evidence of document relevance within a given

scientiﬁc ﬁeld.

Additional attributes are sometimes used, such as

the publication date. As new documents do not have

a sufﬁcient search or citation history, they might be

incorrectly ranked when time is not taken in consid-

eration.

A multitude of relevance attributes thus needs to

482

Veselý M., Rajman M., Le Meur J., Marian L. and Caffaro J..

D-RANK: A FRAMEWORK FOR SCORE AGGREGATION IN SPECIALIZED SEARCH.

DOI: 10.5220/0003293404820485

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 482-485

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

be aggregated within the document ranking process.

In this paper we propose an aggregation mechanism

that allows for aggregation of a multitude of query-

independent attributes. We use two approaches, one

aggregating the attribute scores and another one, ag-

gregating ranks using weighted sum and logistic re-

gression as the aggregation vehicle. We present the

evaluation framework that targets the CDS document

collection, a production database used at CERN.

In the section 2 we outline the aggregation

method, in the section 3 we present the experimen-

tal data set up, in section 4 we present results that we

obtained on a test data set and we conclude in section

2 SCORE AGGREGATION

We divide the process of score aggregation in three

step: (i) ﬁrst we select relevant ranking attributes that

are convenient for aggregation, (ii) in the second step,

scores need to be normalized and re-scaled, and (iii)

ﬁnally, scores are aggregated via a score aggregation

function.

Selection of Attributes. In the ﬁrst phase we select

attributes that are convenient for aggregation. At-

tributes that are not correlated are good candidates

for aggregation. On the other hand attributes that are

highly correlated can be considered as substitutes and

in that case we can selected only one of them.

We noticed that usage of traditional correlation

coefﬁcients such as Spearman Rank correlation or

Kendal Tau coefﬁcients do not take into account the

importance of low ranks. For this reason the correla-

tions should be adjusted so as to put more weight on

changes that occur in the upper part of the ranked list.

Some work in this direction has been also suggested

by (Yilmaz et al., 2008).

Score Normalization. In the second step we normal-

ize scores so that they reﬂect the underlying distribu-

tion of values. The idea is that a normalized score

should reﬂect the proportion of the population of doc-

uments with lower scores as they are observed for a

given ranking attribute. For example, if a score of N

corresponds to a median score among all of the ob-

served scores, it should be converted into a normal-

ized score of 0.5.

To determine the normalization function for each

of the attributes, we ﬁrst calculated values at a per-

centile level. We then smoothed the obtained values

using standard density estimation techniques to ap-

proximate the underlying densities. We then construct

the cumulative distribution function summing up val-

ues over corresponding interval.

Score Aggregation. The task of score aggregation

was previously addressed in several works. Garcin et

al (Garcin et al., 2009) analyze aggregation of feed-

back ratings into a single value. They consider differ-

ent aggregations relying on informativeness, robust-

ness and strategyproofness. On all these attributes,

they show that the mean seems to be the worst way of

aggregating ratings, while the median is more robust.

In previous works, logistic regression was also used

as a vehicle to aggregate scores (Le Calv´e and Savoy,

2000) (Jacques Savoy and Vrajitoru., 1996) (Craswell

et al., 1999). In our preliminary study we adopted the

two mentioned aggregation models based on logistic

regression and a weighted sum.

To rank documents with logistic regression we

ﬁrst compute the value of logit that corresponds to

a particular combination of scores of individual doc-

uments. We then project the obtained result on the

logistic curve and read the resulting aggregated score

on the Y-axis.

A more detail about the implementation of the

rank aggregation with logistic regression in d-Rank

can be obtained in (Vesely and Rajman, 2009). In our

study we worked with chosen regression coefﬁcients

for which we tested a variety of combinations. Even-

tually coefﬁcients should be learned through an auto-

mated procedure. The way we have generated data

for our experiments is in more detail described in the

next section.

3 EXPERIMENTAL SETUP

Within our work we plan to perform two types of eval-

uation: a system evaluation using a referential that we

extracted from the user access logs, and a user-centric

evaluation (Voorhees, 2002).

In order to proceed with the system evaluation, we

needed a referential that would allow us to compute

and compare our system using standard information

retrieval measures. To our knowledge the CDS col-

lection was not used within an information retrieval

evaluation in the past. One of the results of our work

thus is a referential that allows for a system evalua-

tion of various document retrieval scenarios featured

by the CDS retrieval system.

To create the referential of relevance judgments,

we opted for parsing the user access logs for queries

that were issued by users of the CDS search system

in the past. This way we have obtained a set of test

queries for experimentation that is close to a real-

world scenario. For this purpose, we created a tool

that allows us to parse user access logs and extract

D-RANK: A FRAMEWORK FOR SCORE AGGREGATION IN SPECIALIZED SEARCH

483

information that is essential for our experimentation,

including search phrases, search attributes and corre-

sponding relevant documents. Queries in the refer-

ential are composed from all query terms that were

used by a user including all parameters that were used

at the search time. Document identiﬁers then corre-

spond to a known relevant documents that were down-

loaded upon a search were then added. A typical ref-

erential entry looks as follows:

Query terms: Ellis, John

Field: Author

Collection: Published Articles

Action type: Search

Relevant document: 1282439

Relevant document: 1257907

...

For our initial experimentation we also generated

a small data set constructed in the following way: we

generated a collection of one thousand documents and

ranked them using ﬁve artiﬁcial independent rank-

ing attributes. Furthermore we assumed that the in-

dividual ranking attributes do perform relatively well

when used separately to perform ranking. Our ref-

erential thus contains a set of documents that were

ranked high, the referential documents were selected

randomly with a log-normal distribution of ranks to

favor good individual performance.

As far as the evaluation measure is concerned, we

opted for the Average Precision and the Mean Recip-

rocal Rank, used previously in the TREC evaluations.

These evaluation measures put more weight on better

ranked relevant documents for each query in the eval-

uation set. The lower rank the relevant document is

observed, on average, the better performance of the

ranker is calculated.

4 RESULTS

In this section we present results that we obtained on

a generated data set. We have conducted the follow-

ing two experiments. In the ﬁrst experiment we fo-

cused on performance of our aggregates and we com-

pared them to the best-performing individual ranking

attribute. In the second experiment, we focused on

how the ranking aggregation allows to lift relevant

documents from the bottom of the ranked list to the

visible area. We selected a threshold of Rt, a rank

that splits a ranked document list to two parts, the one

that was visible to the user on the search output, and

the ”invisible” one. We then kept only relevant docu-

ments that belong to the invisible part of the list in the

referential (i.e. removed documents that ranked well

enough). We again computed the AP evaluation mea-

sure for the aggregates. This way we could estimate

the quantity of relevant documents that did not score

well enough using the individual rankings, and were

lifted into the visible ranking area after aggregation.

In our experiment we selected Rt=100.

We aggregated the lists in two different ways: by

score and by rank. We used the logistic regression

aggregation framework and a simple weighted sum

aggregation for comparison. We thus evaluated the

following four ranking aggregates: (i) weighted sum

of scores (AW-S), (ii) weighted sum of ranks (AW-R),

(iii) score aggregate using logistic regression (LR-S),

and (iv) rank aggregate using logistic regression (LR-

R). We then calculated the MRR and AP@k mea-

sures for k in 5,10,20,50,100. The obtained results

are shown in the Figure 1 and in the Table 1.

0.2

0.4

0.6

0.8

5 10 20 50 100

Baseline

AW(R)

AW(S)

LR(R)

LR(S)

Figure 1: Average Precision at various levels for the best

individual ranking attribute (baseline) and ranking aggre-

gates.

0.2

0.4

0.6

0.8

5 10 20 50 100

Best ranking w/o aggregate

Aggregate via linear combination (ranks)

Aggregate via linear combination (scores)

Aggregate via logistic regression (ranks)

Aggregate via logistic regression (scores)

Figure 2: Potential for discovery of relevant documents us-

ing (AP@k), Rt=100.

Table 1: Results of the evaluation run on test data

(AP@5,10,20 and MRR).

AP@5 AP@10 AP@20 MRR

Baseline 0.621 0.606 0.595 0.59

AW-S 0.693 0.638 0.598 0.632

LR-S 0.627 0.628 0.606 0.547

LR-R 0.351 0.294 0.293 0.447

AW-R- 0.435 0.429 0.416 0.477

As shown in the Table 2 the performance of the

ranking aggregate based on logistic regression with

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

484

ranks provides best performance in terms of the mean

average precision considering only relevant docu-

ments that were presumably not seen on lists ranked

with the individual attributes.

We now proceed with the signiﬁcance measure-

ment for the second experiment. We worked with a

10-fold data sample. Table 3 shows values for all pairs

of aggregated measures. As shown, we have found

a non-signiﬁcant difference between the two score-

based aggregations AW-S and LR-S on 95% level of

signiﬁcance.

Table 2: 10-Fold validation test for measuring lift with

AP@5.

Fold Base AW-R- AW-S LR-R LR-S

1 0.077 0.231 0.080 0.382 0.053

2 0.064 0.197 0.067 0.394 0.060

3 0.062 0.210 0.067 0.381 0.054

4 0.064 0.181 0.069 0.396 0.089

5 0.062 0.248 0.070 0.412 0.069

6 0.054 0.224 0.065 0.363 0.063

7 0.044 0.243 0.061 0.345 0.069

8 0.049 0.218 0.062 0.415 0.056

9 0.056 0.199 0.069 0.310 0.083

10 0.044 0.236 0.054 0.414 0.076

Mean 0.058 0.219 0.066 0.381 0.067

StDev 0.010 0.022 0.007 0.034 0.012

Table 3: Signiﬁcance test for measuring lift using AP@5.

AW-S AW-R LR-S LR-R

Baseline 2.25 21.1 1.88 30.0

AW-S - 21.1 0.18 28.9

AW-R - 19.1 12.8

LR-S - 27.6

5 CONCLUSIONS AND FUTURE

WORK

In this paper we proposed a framework for score ag-

gregation in specialized search systems. In particu-

lar we focused on ranking of scientiﬁc documents in

the particle physics community. We addressed the is-

sues of score normalization and aggregation through

methods of kernel density estimation and logistic re-

gression as possible vehicles for rank aggregation. We

have presented results from two experiments suggest-

ing that score-based aggregation favored performance

in terms of Mean Reciprocal Rank and Average Preci-

sion, while rank-based aggregation favored document

discovery.

In the future work we plan to proceed with user-

centric evaluation on real-world information retrieval

system. The goal is to conﬁrm our preliminary results

obtained on a small test data collection and we plan to

apply an automated procedure to learn the aggregated

scoring function.

REFERENCES

Craswell, N., Hawking, D., and Thistlewaite, P. B. (1999).

Merging results from isolated search engines. In Aus-

tralasian Database Conference, pages 189–200.

Garcin, F., Faltings, B., and Jurca, R. (2009). Aggregating

reputation feedback. In Paolucci, M., editor, 1st Inter-

national Conference on Reputation (ICORE), pages

62–74, http://www.reputation09.net.

Gentil-Beccot, A., Mele, S., Holtkamp, A., O’Connell,

H. B., and Brooks, T. C. (2008). Information resources

in high-energy physics: Surveying the present land-

scape and charting the future course. J. Am. Soc. Inf.

Sci. Technol., 60(arXiv:0804.2701.):150–160. 27 p.

Jacques Savoy, A. L. C. and Vrajitoru., D. (1996). Report

on the trec-5 experiment: Data fusion and collection

fusion.

Le Calv´e, A. and Savoy, J. (2000). Database merging strat-

egy based on logistic regression. Inf. Process. Man-

age., 36(3):341–359.

Vesely, M. and Rajman, M. (2009). Rank Aggregation

in Scientiﬁc Publication Databases Based on Logistic

Regression. Technical report.

Voorhees, E. (2002). The philosophy of information re-

trieval evaluation. In In Proceedings of the The Sec-

ond Workshop of the Cross-Language Evaluation Fo-

rum on Evaluation of Cross-Language Information

Retrieval Systems, pages 355–370. Springer-Verlag.

Yilmaz, E., Aslam, J. A., and Robertson, S. (2008). A new

rank correlation coefﬁcient for information retrieval.

In SIGIR, pages 587–594.

D-RANK: A FRAMEWORK FOR SCORE AGGREGATION IN SPECIALIZED SEARCH

485