Language Model and Clustering based Information Retrieval

Irene Giakoumi, Christos Makris and Yiannis Plegas

Department of Computer Engineering and Informatics, University of Patras, Patra, Greece

Keywords:

Clustering, Information Retrieval, Kullback-Leibler Divergence, Language Models, Query Expansion,

Redundant Elimination, Semantics.

Abstract:

In this paper, we describe two novel frameworks for improving search results. Both of them organize relevant

documents into clusters utilizing a new soft clustering method and language models. The ﬁrst framework is

query-independent and takes into account only the inter-document lexical or semantic similarities in order to

form clusters. Also, we try to locate the duplicated content inside the formed clusters. The second framework is

query-dependent and uses a query expansion technique for the cluster formation. The experimental evaluation

demonstrates that the proposed method performs well in the majority of the results.

1 INTRODUCTION

Information retrieval aims at satisfying an informa-

tion need by ranking documents optimally. In re-

trieval systems, an information need is expressed in

the form of a query (Manning et al., 2008). The goal

is to rank relevant documents higher than the non-

relevant. Clearly, the performance of an information

retrieval system is determined by the chosen retrieval

function. A retrieval model formalizes the notion of

relevance and derives a retrieval function that can be

computed to rank documents.

Effective retrieval functions have been derived

from a class of probabilistic models, the language

modelling approaches. Essentially, the idea is the

computation of probabilistic distributions over doc-

uments or collections of documents. Several ap-

proaches have proven the effectiveness of this method

for information retrieval.

Recently, language models have been used in

combination with another approach of information re-

trieval, which is the cluster-based retrieval. Under this

line, documents relevant to a query are grouped into

clusters and the rank of them depends on the cluster

that they belong to. Cluster-based retrieval is based

on the cluster hypothesis according to which simi-

lar documents will satisfy the same query (Rijsber-

gen, 1979). Its aim is to identify the good clusters

for the given query. Recent researches have shown

that if these good clusters were able to be found

then retrieval performance would be improved over

document-based retrieval, where the query is matched

against documents (Raiber and Kurland, 2012).

An issue that affects the performance of infor-

mation retrieval systems is the content duplication, a

problem known as redundant elimination. The same

information is often met in more than one documents,

while the ideal answer to a query should be all the

unique information in descending order of relevance.

The solution of this problem remains a challenge.

In this paper, we propose two new frameworks

in order to improve query search results by form-

ing and selecting the good clusters cluster-based

retrieval searches for. Both of the frameworks

form clusters of relevant documents using a new

soft clustering method and the language modelling

approach. The ﬁrst proposed method examines,

query-independently, either lexical or semantic inter-

document similarities in order to form clusters. In or-

der to further improve search results, we examine the

case of locating redundant information in the clusters

of relevant documents our method forms. To do so,

we implement and apply an existing redundant elimi-

nation method over clusters. The second one creates a

cluster for every possible query expansion, when se-

mantics of query are taken into account. Our ﬁnal

results are lists of documents with improved ranking.

The rest of this paper is organized as follows:

We brieﬂy survey language models, cluster-based re-

trieval and the redundant elimination problem in Sec-

tion 2, while in Section 3 we describe the language

models estimation. We detail our frameworks for im-

proving search results in a query independent way in

Section 4 and in a query dependent way, using a query

479

Giakoumi I., Makris C. and Plegas Y..

Language Model and Clustering based Information Retrieval.

DOI: 10.5220/0005409804790486

In Proceedings of the 11th International Conference on Web Information Systems and Technologies (WEBIST-2015), pages 479-486

ISBN: 978-989-758-106-9

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

expansion technique in Section 5. Section 6 contains

the experimental evaluation of our methods and we

conclude in Section 7.

2 RELATED WORK

A statistical language model assigns probabilities to a

sequence of words by means of a probability distribu-

tion. This concept has been used in various applica-

tions such as speech recognition and machine trans-

lation. The ﬁrst to employ language models for in-

formation retrieval were Ponte and Croft (Ponte and

Croft, 1998). Their basic idea was: estimate a lan-

guage model for each document and rank documents

by the likelihood scoring method. Since then several

variants of this basic method have been proposed as

well as several improvements of it.

Zhai and Lafferty (Zhai and Lafferty, 2004) have

proved the importance of the selection of smoothing

parameters. The term smoothing refers to the adjust-

ment of the maximum likelihood estimator of a lan-

guage model so that it will be more accurate and at

least it will not assign zero probability to words that

are not met in a document, which is known as the zero

probability problem.

Recent researches attempted to exploit the cor-

pus structure, which is, clusters of documents used

as a form of document smoothing using the language

modelling retrieval framework. Language models

over topics are constructed from clusters and the doc-

uments are smoothed with these topic models to im-

prove document retrieval ((Kurland and Lee, 2004);

(Liu, 2006); (Liu and Croft, 2004)). In this line Liu

and Croft (Liu and Croft, 2004) cluster documents

and smooth a document with the cluster containing

the document. Also, Kurland and Lee (Kurland and

Lee, 2004) suggested a framework which for each

document obtains the most similar documents in the

collection and then smooths the document with the

obtained ”neighbour documents”. These neighbour-

hoods of documents are formatted using as measure

the Kullback-Leibler (KL) divergence (Kullback and

Leibler, 1951) which is an asymmetric measure of

how different two probability distributions are. This

measure has been used in several researches ((Laf-

ferty and Zhai, 2001); (Kurland, 2006)).

The use of clusters for smoothing purposes is an

approach to cluster-based retrieval, while the most

common approach aims to create and retrieve one or

more clusters in response to a query. Cluster-based

retrieval is an idea that has a long history and many re-

searchers have worked on it. Using as base the cluster

hypothesis (Rijsbergen, 1979) several hard clustering

methods were employed ((Croft, 1980); (Jardine and

van Rijsbergen, 1971); (Voorhees, 1985)) as well as

several probabilistic (soft) clustering methods ((Blei

et al., 2003); (Hofmann, 2001)). The soft clustering

methods accept the following case: a document could

discuss multiple topics, and if one assumes that clus-

ters represent topics, then the document should be as-

sociated with the corresponding clusters. This is the

case that is accepted in this paper.

Finally, another important problem in informa-

tion retrieval, that is needed to refer to, is the prob-

lem of information redundancy. Several approaches

have been proposed attempting to resolve this prob-

lem. Some of them use ranking algorithms in order

to reduce the redundant information from the search

results ((Agrawal et al., 2009); (Chen and Karger,

2006); (Radlinski et al., 2009)), while some others

reward novelty and diversity covering all aspects of a

topic ((Agrawal et al., 2009); (Clarke et al., 2008)).

An interesting approach (Plegas and Stamou, 2013)

extracts the novel information between documents

with semantically equivalent content and creates a

single text, called SuperText, relieved from the dupli-

cated content. The suggestion we make in this paper

is that clusters of documents, assuming that clusters

represent topics, are an ideal group of documents to

locate repeated information.

Since a query could have many interpretations and

the documents retrieved for a query could discuss dif-

ferent aspects of the same topic, the ranking of docu-

ments in results list should take these parameters into

account. Being inspired from the related work previ-

ously referred, we consider the idea of forming over-

lapping clusters of relevant documents, taking advan-

tage of the asymmetry of KL-divergence when it is

applied over document language models. The clus-

tering method, we suggest, not only locates pairs of

similar documents, but also locates which of the two

is more similar to the other. The ﬁrst method we sug-

gest locates inter-document lexical similarities using

our soft clustering method applied over document lan-

guage models. Subsequently, we implement the same

idea, but this time we try to locate inter-document

semantic similarities. In order to do this, we exam-

ine a new way to embed semantic information within

document language models. This method is query-

independent, since the query is being used only for

ranking purposes. As we previously referred, we con-

sider the case of locating duplicated content inside

the formed clusters, to further improve search results.

For this purpose, we apply a redundant elimination

method (Plegas and Stamou, 2013) not over the ini-

tial search results list, but over each of our clusters,

which is also a new try to remove duplicated informa-

WEBIST2015-11thInternationalConferenceonWebInformationSystemsandTechnologies

480

tion. On the other hand, the second method we pro-

pose is query dependent as the query is important for

the cluster formation. Speciﬁcally, we apply an exist-

ing query expansion method (Makris et al., 2012) to

form all possible interpretations of a query and using

our clustering method we form clusters that answer

to every expansion of the query. This method takes

into account both lexical and semantic information of

the query in order to calculate the document language

models. The novelty of this method is located into the

way that the expansions of a query are utilized. In the

following sections we describe in detail the suggested

frameworks.

3 LANGUAGE MODEL

ESTIMATION

We represent the information contained in the docu-

ments of query results with the language modelling

approach. The basic idea of this method is the cal-

culation of a probability measure over strings that be-

long to a ﬁxed vocabulary V (Manning et al., 2008).

That is, considering a ﬁxed set of strings, a document

language model calculates the probability of locating

these strings in the document. In our case, vocabu-

lary V consists of words or senses. For simplicity, we

chose to use the Smoothed Maximum Likelihood Es-

timate (SMLE) model in our approach. According to

the Maximum Likelihood Estimate (MLE) model the

representation of a document d is based on the num-

ber of occurrences of each term t in a ﬁxed vocabulary

MLE

(t) =

t,d

, ∀t ∈ V (1)

,where f

t,d

is the number of occurrences of term t in

document d and l

is the number of terms contained

in d and also in V. The MLE model suffers from the

zero probability problem, that is:

MLE

(t) = 0, ∀t ∈ V & ∀t 3 d (2)

In order to solve it, we smooth the model accord-

ing to the following formula:

SMLE

(t) =







t,d

− c, if t ∈ d

eps, if t 3 d

, ∀t ∈ V (3)

,where eps is a very small quantity of the order of

-10

and c is estimated as

c = eps ∗

|V | − s

|V |

(4)

,where s

is the number of terms that belong in V and

the document d and |V | is the length of vocabulary

used.

4 CLUSTERING WITH LEXICAL

OR SEMANTIC LANGUAGE

MODELS

In this section we describe our method for locating

inter-document similarities and forming clusters uti-

lizing the Kullback-Leibler divergence over lexically

or semantically enhanced language models. Here the

issued query is taken into account only for ranking

purposes, not for the creation of the clusters.

4.1 Document Representation

In our experiments, we use lexically or semantically

enhanced language models. The difference between

their estimation concerns the ﬁxed vocabulary V. But

before describing the set V in each case, we need to

describe the initial processing of the search result list.

We select the top N documents retrieved for a

query, which will be called collection D. We break

each document d in collection D into sentences. Each

sentence is tokenized, POS-tagged and lemmatized

and all lemmas are removed but the ones that corre-

spond to nouns, verbs and adjectives. The set of pro-

cessed documents of collection D will be denoted as

collection D’. In a similar way we process the issued

query. Now that collection’s D’ documents and the

query contain only certain lemmatized terms, along

with their POS-tag, we describe the ﬁxed vocabulary

V for our language models.

The set V of the lexically enhanced language mod-

els consists of all the unique POS-tagged lemmas con-

tained in the collection D’. In other words, we exam-

ine the inter-document similarities considering only

the vocabulary that the documents contain. For rea-

sons of dimensionality reduction, we evaluate also the

case of reducing the size of V, using the normalized

TF-IDF schema. In this case, the terms in V are or-

dered by their TF-IDF values and the desired amount

of terms in V is selected. Also the content of docu-

ments in D’ is updated, so that they contain only terms

of V.

In the case of semantically enhanced language

models, ﬁrstly, all the terms contained in collection D’

are issued to WordNet, in order to locate all the pos-

sible senses each term can have. Continuing, we map

all terms of a document to their corresponding sense.

Words matching only one sense are annotated with it.

Words matching several senses are annotated with the

sense that exhibits the maximum average similarity

to the senses identiﬁed for the remaining document

words. We estimate semantic similarity based on the

Wu and Palmer metric (Wu and Palmer, 1994). The

LanguageModelandClusteringbasedInformationRetrieval

481

similarity between two senses s

and s

from two terms

and w

is given by:

Similarity(s

, s

) =

2 ∗ depth(LCS(s

, s

))

depth(s

) + depth(s

)

(5)

Since the appropriate senses for w

and w

are not

known, our measure selects the senses which maxi-

mize Similarity in order to annotate every term in the

document with an appropriate sense. Here, the simi-

larity between documents is examined over the senses

each one contains. The set of the unique senses anno-

tated to the documents of D’ will be used as the set V,

considering, also, the option of reducing its size, in a

way similar to the lexical language model case using

the TF-IDF schema.

After this procedure, which is necessary for the

language model estimation, we have the needed ﬁxed

vocabulary V and the contents of documents in D’ in

the appropriate form. Using the formula described in

equation (3) we compute the lexically or semantically

enhanced language models for the N documents.

4.2 Soft Clustering Method

Now that documents are represented by probability

distributions, we focus on locating the similar ones

using KL divergence. Its value is given by:

KLD( f (x)kg(x)) =

∑

f (x) ∗ log

f (x)

g(x)

(6)

,where f(x) and g(x) are discrete probability distribu-

tions. The smaller the KL-divergence’s values are,

the more similar are the distributions. The smallest

valid value is zero indicating that the distributions are

identical. KL-divergence is non-symmetric, a char-

acteristic that we take advantage of, in contrast to

other research (Liu and Croft, 2004). So, it could

be considered that formula (6) measures the infor-

mation loss if distribution g(x) replaces f(x), while

KLD(g(x)k f (x)) measures the information loss if dis-

tribution f(x) replaces g(x). In our case the probability

distributions represent texts. That is, calculating the

values of KL-divergence between documents we i) lo-

cate documents that have similar vocabulary or senses

and so documents referring to the same topic and ii)

examine between two documents which is more sim-

ilar to the other, that is which contains less new infor-

mation. Taking advantage of the asymmetry of KL-

divergence, we estimate:

KLD(M

SMLE

(t)kM

SMLE

(t)) &

KLD(M

SMLE

(t)kM

SMLE

(t),

∀d

, d

∈ D

& i 6= j.

(7)

Initially, we consider that each document of D’

forms a singleton cluster. For every pair of documents

we compare their both estimated KL-divergence val-

ues and if at least one of them is small enough, then

the two documents are lexically or semantically (de-

pending on which set V is used) similar and should

belong to the same cluster. We determine a parameter

t as an upper bound for the values of KL-divergence

of similar documents. The decision of which docu-

ment will be considered similar to another is taken

according to the following formula:

if{KLD(d

) ≤ t OR KLD(d

) ≤ t}

then

(

∈ Cl(d

), iff KLD(d

) < KLD(d

)

∈ Cl(d

), iff KLD(d

) < KLD(d

)

∀i, jwhere i 6= j&1 < i, j < |D|.

(8)

,where Cl(d

) is the cluster which initially contained

document d

. For every cluster we consider as basis

the document that it initially contained. During the

process of clustering some of the resulting clusters

may appear as members of a bigger cluster. In this

case, the internal clusters are eliminated and only the

external is kept without removing any of its contain-

ing documents.

4.3 Removing Redundant Information

At this point, we have organized the documents into

overlapping clusters of similar documents. We claim

that only documents in the same cluster have dupli-

cated information. In order to locate and remove it,

we merge the documents of each cluster into a single

text keeping all the unique information only, an idea

that origins from (Plegas and Stamou, 2013). If, after

clustering, a cluster contains only its basis document,

then that document remains as it is. If a cluster con-

tains more than one document, then to the content of

the basis document is added the new unique informa-

tion contained in the other documents in the cluster

(member documents).

4.4 Ranking

The process described in the previous section results

in a new collection of documents, which contains

texts generated from more than one documents con-

tained in the same cluster and possibly some of the

documents initially retrieved for the query (clusters

that remained singleton). The last step of our method

is to present this collection in a ranked list of de-

creasing order of interest. We estimate the lexical en-

hanced language model for all the documents in the ﬁ-

WEBIST2015-11thInternationalConferenceonWebInformationSystemsandTechnologies

482

nal collection and for the query, using as ﬁxed vocab-

ulary the POS-tagged lemmas contained in the query.

At this point we choose not to use senses, but only

terms, for the estimation of language models, because

ﬁrstly the usually small number of terms contained in

a query does not help the application of a word sense

disambiguation method and secondly we would not

want to focus on a certain interpretation of the query.

Having estimated the language models, we estimate

how similar the language model of each document is

to the language model of the query by the correspond-

ing values of KL-divergence. The ranking of the val-

ues of KL-divergence in increasing order derives the

desired ranking list of the documents.

5 CLUSTERING USING QUERY

EXPANSION

This section describes our method for the clusters’

creation, taking into account the different interpreta-

tions a query could have. We attempt to form clusters

of documents, that each answers to a speciﬁc expan-

sion of the query, that is clusters that answer to a spe-

ciﬁc interpretation of it.

5.1 Document Representation

In order to represent the contents of the documents in

results’ list we use, this time, only lexical language

models. We made this choice because the language

models are calculated relying on the query expansions

as will be explained in this and the following section.

Again from the results’ list we use the top N

documents, this is collection D. As in the previous

method, we break the documents into sentences and

each sentence into tokens, which subsequently are

POS-tagged, lemmatized and reduced in size as we

keep lemmas of nouns, verbs and adjectives. Again,

the processed collection D we call it collection D’.

5.2 Expanding the Query

Query expansion is the process of reformulating

the issued query to improve retrieval performance

(Baeza-Yates and Ribeiro-Neto, 2011). In our

method, we expand the queries employing an idea

based on a already existing method. The idea is to

replace the initial query with a set of queries (collec-

tion Q), each one containing a different combination

of senses of the query terms (Makris et al., 2012).

The query expansion procedure operates as fol-

lows: Firstly we tokenize and label with POS-tags

the query terms, and we keep only lemmas of nouns,

verbs and adjectives as we did for the collection D.

The next step is to issue each of these terms into

WordNet. This results in a set of senses for every

query term. For every sense WordNet gives a deﬁ-

nition for it, called gloss. We process these glosses

in order to extract from them lemmas of nouns, verbs

and adjectives with the same procedure described pre-

viously. The ﬁnal step to create the collection Q of

expanded queries is to consider all possible combina-

tions of senses for every term and add the extracted

terms from glosses to the initial query. Before clus-

tering, we need to compute language models over the

documents in collection D’ and for that we need a

ﬁxed vocabulary V. We want to form a cluster for each

expanded query, so we choose to use as ﬁxed vocabu-

lary the terms contained in each expanded query. The

language models of documents are estimated as de-

scribed in Section 3.

5.3 Clustering

After we have computed the language model of each

expanded query and of the documents as described in

the previous subsection, we use a clustering method

similar to the one described in subsection 4.2.

if KLD(q

) ≤t then d

∈ Cl(q

∀q

∈ Q & ∀d

∈ D

(9)

Here, the Kullback-Leibler divergence values are

calculated over the language model of every expanded

query and every document in order to evaluate which

probability distributions of the documents are similar

to the expanded query’s distribution. The result of this

process is a cluster for every expansion of the query.

These clusters may be overlapping. Since the length

of the ﬁxed vocabulary V is generally quite short, we

evaluate our clustering method for different values of

the parameter t.

5.4 Ranking

At this stage we have as many clusters as the ex-

panded queries. The last step is to rank our results.

First we rank the documents in each cluster according

to the Kullback-Leibler divergence values calculated

previously. The documents are sorted in ascending

order. Finally, we rank the clusters. In order to re-

duce the computational cost we rank the clusters and

consequently the expanded queries, according to the

number of documents each cluster has. The intuition

behind this choice is that we choose to place at the

top positions the expanded query for which most of

the documents in collection D have information.

LanguageModelandClusteringbasedInformationRetrieval

483

6 EXPERIMENTAL EVALUATION

To carry out our evaluation, we explored the same 70

web queries for both our of methods from the TREC

WebTracks(2011, 2012). We selected the queries with

the best characteristics in order to perform our exper-

iments. All tracks use the 1 billion page ClueWeb09

(http://lemurproject.org/clueweb09/) dataset and con-

tain a diversity task that contains a ranked list

of documents that covers the query topic avoid-

ing information redundancy. For every result there

is a relevance judgement by NIST assessors in-

dicating their relevance with their query topic.

Also we have employed the search engine In-

dri (http://www.lemurproject.org/indri/) and for each

query we retained the top 50 results for both evaluat-

ing methods. In all of the ﬁgures that follow, vertical

axis represents the a-nDCG values, while horizontal

axis represents the rank position of documents in ﬁ-

nal results’ list.

6.1 Evaluating the First Method

We created clusters of similar documents and the up-

per bound of KL-Divergence value, parameter t, was

experimentally tested and took the value of t=10.0.

We evaluated our results keeping 20%, 40%, 60%,

80% and 100% of the terms/senses in the ﬁxed vocab-

ulary V. We compared our techniques: semantic (se-

mantically enhanced) and lexical (lexical enhanced)

with an approach that used k-means as clustering al-

gorithm (we did that to test the effect of the clustering

method).

Finally we assessed the performance of the pro-

posed methods by comparing their rankings with the

initial rankings of search engine Indri. We mea-

sured the efﬁciency of the proposed methods using

(i) the relevance judgements and (ii) the a-nDCG (a

-normalized Discounted Cumulative Gain) measure

(Clarke et al., 2008) with a=0 where the value of 1.0

is a good indicator.

Figure 1 contains the results for our method when

we kept the 60% of the terms/senses, as this is the best

case. The results of the other four examined cases are

very close to the presented one.

According to the results our method exceeds base-

line. Also k-means is better than our clustering

method. This is an indicator that a less ”strict” value

of t is needed. Results show, also, that semantically

enhanced language models do not give as good results

as lexical language models do at the top rank posi-

tions. We claim that, this indicates that semantically

enhanced language models are more sensitive when

calculating them.

Figure 1: Average a-NDCG value for every rank position

keeping the 60% of terms/senses.

6.2 Evaluating the Second Method

With this method we created clusters for every expan-

sion of the queries. Because of the short length of

the ﬁxed vocabulary used, the value of t=10.0 could

be considered quite ”strict”. For this reason we evalu-

ated our method for the following values of parameter

t: 10.0, 12.0, 14.0, 16.0, 18.0 and 20.0.

Figure 2 contains the results of our method. This

method, also, exceeds baseline and shows that value

16.0 gives the best results, which conﬁrms our argu-

ment about the need of a less strict and not very tol-

erant value for parameter t. We should also note that

this method is suitable for queries that their terms are

contained into WordNet.

Figure 2: Average a-nDCG value for every rank position.

Finally, we compared the best results of the two

proposed methods. Figure 3, shows that clustering in

a query-dependent way gives a better ranking com-

pared with the case of clustering independently to

WEBIST2015-11thInternationalConferenceonWebInformationSystemsandTechnologies

484

the query. This could be considered justiﬁed, since

clusters’ formation based on query’s interpretations is

more adjustable and less sensitive to the choices of

the documents’ authors. Also, our clustering algo-

rithm gives better results than k-means when applied

to the second proposed method.

Figure 3: Average a-nDCG value for every rank position

when comparing the two proposed methods.

Previous results show that in all cases, our pro-

posed methods perform better than the baseline of

Indri search engine. That is encouraging and shows

that our methods have many good results to demon-

strate. In order to improve our results our next step

is to improve the performance of the semantically en-

hanced language models, as we claim that is a valu-

able tool and could even exceed the performance of

lexical language models. Also, we intend to do further

research on the performance of our clustering method

and compare it with more clustering algorithms. We

believe that our methods could give important results

when comparing our results with the application of

different language models, smoothing methods and

redundant elimination methods.

7 CONCLUSIONS

In this paper we presented two novel frameworks that

improve the results’ list of a query. The ﬁrst method

creates overlapping clusters using our proposed clus-

tering method and lexically or semantically enhanced

language models. Additionally, it removes the redun-

dant content from the clusters. The second method,

we propose, creates a set of overlapping clusters,

where each cluster answers to one of the interpreta-

tions of the initial query. The novelty of these frame-

works is the suggested clustering method, that utilizes

the asymmetry of KL-Divergence, the way of seman-

tic enhance of language models, the use of a query

expansion technique in order to form clusters, that

each answers to an interpretation of the initial query

and also the application of a redundant elimination

method over clusters. The experimental results indi-

cated that the proposed methods perform quite well in

relation to the state of the art techniques.

ACKNOWLEDGEMENTS

This research has been co-ﬁnanced by the European

Union (European Social Fund ESF) and Greek na-

tional funds through the Operational Program ”Edu-

cation and Lifelong Learning” of the National Strate-

gic Reference Framework (NSRF) - Research Fund-

ing Program: Thales. Investing in knowledge society

through the European Social Fund

REFERENCES

Agrawal, R., Gollapudi, S., Halverson, A., and Ieong, S.

(2009). Diversifying search results. In Proceedings

of the Second ACM International Conference on Web

Search and Data Mining. ACM.

Baeza-Yates, R. and Ribeiro-Neto, B. (2011). Modern In-

formation Retrieval: the concepts and technology be-

hind search. ACM Press, 2nd edition.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3:993–1022.

Chen, H. and Karger, D. R. (2006). Less is more: Prob-

abilistic models for retrieving fewer relevant docu-

ments. In Proceedings of the 29th Annual Interna-

tional ACM SIGIR Conference on Research and De-

velopment in Information Retrieval, SIGIR ’06, pages

429–436. ACM.

Clarke, C., Kolla, M., Cormack, G., Vechtomova, O.,

Ashkan, A., Buttcher, S., and MacKinnon, I. (2008).

Novelty and diversity in information retrieval evalua-

tion. In Proceedings of the 31st Annual International

ACM SIGIR Conference on Research and Develop-

ment in Information Retrieval, SIGIR ’08, pages 659–

666. ACM.

Croft, W. (1980). A model of cluster searching based on

classiﬁcation. Information Systems, 5(3):189 – 195.

Hofmann, T. (2001). Unsupervised learning by probabilis-

tic latent semantic analysis. Machine Learning, 42(1–

2):177–196.

Jardine, N. and van Rijsbergen, C. (1971). The use of hier-

archic clustering in information retrieval. Information

Storage and Retrieval, 7(5):217 – 240.

Kullback, S. and Leibler, R. A. (1951). On information and

sufﬁciency. Ann. Math. Statist., 22(1):79–86.

LanguageModelandClusteringbasedInformationRetrieval

485

Kurland, O. (2006). Inter-document Similarities, Language

Models, and Ad Hoc Information Retrieval. PhD the-

sis, Cornell University.

Kurland, O. and Lee, L. (2004). Corpus structure, language

models, and ad hoc information retrieval. In Pro-

ceedings of the 27th Annual International ACM SIGIR

Conference on Research and Development in Informa-

tion Retrieval, SIGIR ’04, pages 194–201. ACM.

Lafferty, J. and Zhai, C. (2001). Document language mod-

els, query models, and risk minimization for informa-

tion retrieval. In Proceedings of the 24th Annual Inter-

national ACM SIGIR Conference on Research and De-

velopment in Information Retrieval, SIGIR ’01, pages

111–119. ACM.

Liu, X. (2006). Cluster-based retrieval from a language-

modeling perspective. In Proceedings of the 29th

Annual International ACM SIGIR Conference on Re-

search and Development on Information Retrieval,

pages 737–738.

Liu, X. and Croft, W. B. (2004). Cluster-based retrieval

using language models. In Proceedings of the 27th

Annual International ACM SIGIR Conference on Re-

search and Development in Information Retrieval, SI-

GIR ’04, pages 186–193. ACM.

Makris, C., Plegas, Y., and Stamou, S. (2012). Web query

disambiguation using pagerank. JASIST, 63:1581–

1592.

Manning, C. D., Raghavan, P., and Schutze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Plegas, Y. and Stamou, S. (2013). Reducing information

redundancy in search results. In Proceedings of the

28th Annual ACM Symposium on Applied Computing,

SAC ’13, pages 886–893. ACM.

Ponte, J. M. and Croft, W. B. (1998). A language model-

ing approach to information retrieval. In Proceedings

of the 21st Annual International ACM SIGIR Con-

ference on Research and Development in Information

Retrieval, SIGIR ’98, pages 275–281. ACM.

Radlinski, F., Bennett, P. N., Carterette, B., and Joachims,

T. (2009). Redundancy, diversity and interdependent

document relevance. SIGIR Forum, 43(2).

Raiber, F. and Kurland, O. (2012). Exploring the cluster

hypothesis, and cluster-based retrieval, over the web.

In Proceedings of the 21st ACM International Con-

ference on Information and Knowledge Management,

CIKM ’12, pages 2507–2510. ACM.

Rijsbergen, C. J. V. (1979). Information Retrieval.

Butterworth-Heinemann, 2nd edition.

Voorhees, E. M. (1985). The cluster hypothesis revisited.

In Proceedings of the 8th Annual International ACM

SIGIR Conference on Research and Development in

Information Retrieval, SIGIR ’85, pages 188–196.

ACM.

Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical

selection. In Proceedings of the 32Nd Annual Meeting

on Association for Computational Linguistics, ACL

’94, pages 133–138. Association for Computational

Linguistics.

Zhai, C. and Lafferty, J. (2004). A study of smoothing

methods for language models applied to information

retrieval. ACM Trans. Inf. Syst., 22(2):179–214.

WEBIST2015-11thInternationalConferenceonWebInformationSystemsandTechnologies

486