PSemQE: Disambiguating Short Queries Through Personalised

Semantic Query Expansion

Oliver Baumann

and Mirco Schoenfeld

University of Bayreuth, Universit

atsstraße 30, 95447 Bayreuth, Germany

Keywords:

Information Retrieval, Query Expansion, User-Centric Design, Word Embeddings, Word Sense Disambigua-

tion.

Abstract:

Locating items in large information systems can be challenging, especially if the query has multiple senses

referring to different items: depending on the context, Amazon may refer to the river, rainforest, or a myth-

ical female warrior. We propose and study Personalised Semantic Query Expansion (PSemQE) as a means

of disambiguating short, ambiguous queries in information retrieval systems. This study examines PSemQE’s

effectiveness in retrieving relevant documents matching intended senses of ambiguous terms and ranking them

higher versus a base query without expansion. Synthetic user proﬁles focused on narrow domains were gener-

ated to model well-deﬁned information needs. Word embeddings trained on these proﬁles were used to expand

queries with semantically related terms. Experiments were conducted on corpora of varying sizes to measure

the retrieval of predetermined target articles. Our results show that PSemQE successfully disambiguated pol-

ysemous queries and ranked the target articles higher than the base query. Furthermore, PSemQE produces

result sets with higher relevance to user interests. Despite limitations like synthetic proﬁles and cold-start

issues, this study shows PSemQE’s potential as an effective query disambiguation engine. Overall, PSemQE

can enhance search relevance and user experience by leveraging user information to provide meaningful re-

sponses to short, ambiguous queries.

1 INTRODUCTION

Large knowledge repositories have become a staple of

digital life; however, navigating their content can be

daunting for users. Galleries, libraries, archives, and

museums consistently strive to make their collections

accessible through digital catalogues; there are hardly

any commodities that cannot be located in the vast

catalogue of e-commerce platforms, media-streaming

platforms are abundant in content, and the de facto

go-to online encyclopedia, English Wikipedia, lists

more than seven million content pages as of July

2025. Various methods have been explored in the

ﬁeld of information retrieval to improve the naviga-

bility of large collections and ultimately present users

with relevant results. Query reformulation consti-

tutes a general framework for improving the “coordi-

nation match” (Krovetz and Croft, 1992) to increase

retrieval performance by adding, reweighting, or re-

moving search query terms. The addition of synony-

mous or semantically similar terms to the initial query

https://orcid.org/0000-0003-4919-9033

https://orcid.org/0000-0002-2843-3137

is known as query expansion, which allows the search

engine to potentially ﬁnd more relevant documents

that match these terms.

However, queries may suffer from ambiguity

when the surface forms of the terms have differ-

ent meanings. This is especially apparent for short

queries, where a search for “python” may return doc-

uments related to the programming language, family

of snakes, or any other meaning referred to by this

surface form. While query expansion may still help

shed light on the fact that all these different senses

exist, it will likely take several iterations for the user

to arrive at the results they expected, so much so that

most users will pre-empt this process by performing

manual query expansion at the ﬁrst iteration, provid-

ing a query such as “python programming language”

to narrow down the set of relevant documents. How-

ever, this requires knowledge that the query is am-

biguous in the ﬁrst place.

Personalised query expansion (PQE) can largely

avoid this manual query modiﬁcation by selecting ex-

pansion terms based on user context. Broadly, this

context is any side-channel information that charac-

Baumann, O. and Schoenfeld, M.

PSemQE: Disambiguating Short Queries Through Personalised Semantic Query Expansion.

DOI: 10.5220/0013787500003985

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Web Information Systems and Technologies (WEBIST 2025), pages 289-300

ISBN: 978-989-758-772-6; ISSN: 2184-3252

289

terises a user’s potential information needs, such as

query histories, page views, or social media content.

This approach inherently requires a language model

for both the “user language” and the corpus.

The dominant approach to language modelling

is embedding, by which words are represented as

dense high-dimensional vectors. Recently, the con-

textualised word embeddings underlying large lan-

guage models (LLMs) have advanced information re-

trieval; however, their size and computational cost

during training and inference can become an obsta-

cle for constrained settings. In contrast, static embed-

ding models, such as word2vec and GloVe, remain

efﬁcient to train and query, even on consumer-grade

hardware. Furthermore, they facilitate the develop-

ment of explainable intelligent systems, as querying

the embedding for similar terms directly relates to the

training set, speciﬁcally a user proﬁle, rather than ex-

tensive Web-scale collections, such as the Common

Crawl.

In this study, we propose and analyse a reﬁnement

of PQE, which we refer to as Personalised Seman-

tic Query Expansion (PSemQE). PSemQE expands

queries with semantically close terms with respect to

the user’s context by performing a nearest-neighbour

search in the embedding space. Crucially, it uses lo-

cal embeddings of user proﬁles only during query ex-

pansion and defers actual retrieval to a downstream

search engine.

Although similar systems have been explored pre-

viously, their ability to disambiguate queries for the

user’s beneﬁt is not well understood. This study pro-

vides insights into the extent to which PSemQE can

perform word-sense disambiguation in a user-centric

manner.

Our contributions are summarised as follows:

• Detailed analysis of the rank dynamics governing

personalised query expansion for short, ambigu-

ous queries;

• Outlier analysis of cases where expansion pro-

duces inferior results;

• Collection of a labelled dataset for evaluating

query disambiguation, as well as synthetic user

modelling.

The remainder of this paper is structured as fol-

lows: Section 2 discusses related work and positions

this study in a broader context. In Section 3, we pro-

vide an operationalisation and describe our approach

to data collection and analysis, as well as the experi-

mental design. The results are reported and discussed

in Section 4. Finally, Section 5 concludes the paper

and provides an outlook on future research.

2 RELATED WORK

In the following section, we review the literature re-

lated to our study. We begin by reviewing query

expansion in general before moving to the subﬁeld

of personalised query expansion. As language mod-

elling is an important part of any query modiﬁcation

system, we conclude this section by reviewing the

key contributions in this ﬁeld, including recent ap-

proaches using large language models.

Query Expansion and Ambiguity

Query expansion is a type of query reformulation

that extends the initial query with additional terms

to improve retrieval performance. Suitable candi-

dates for expansion terms can be sourced from the

search results themselves or from “knowledge struc-

tures” (Efthimiadis, 1996).

Relying on search results as a source for ex-

pansion terms constitutes a form of relevance feed-

back (Manning et al., 2008), where the results re-

trieved from an initial search using the original query

are presented to the user, who judges which docu-

ments are relevant to their information needs. To au-

tomate this procedure, the top k results of an initial

search may be regarded as relevant; this is known as

pseudo-relevance feedback.

However, these initial search results may contain

irrelevant documents or miss relevant documents ow-

ing to the ambiguity of words in the corpus and/or

query (Krovetz and Croft, 1992). Therefore, imbal-

ances in the corpus may be reﬂected in the initial re-

trieval if an ambiguous term is overrepresented, lead-

ing to query drift (Croft et al., 2001). Clarity has been

suggested as a measure by which users can be guided

in disambiguating queries (Croft et al., 2001; Cronen-

Townsend and Croft, 2002).

Personalised QE

The prevalence of user-generated data on the Web has

been inﬂuential in the development of personalised

query expansion. By considering users’ actions out-

side the search context, these approaches attempt to

produce results that are closely tied to users’ interests.

A wide variety of sources from which expansion

terms can be drawn have been explored, such as

emails (Chirita et al., 2007), social tagging (Bouad-

jenek et al., 2019), and the fusion of social annota-

tions with knowledge bases (Zhou et al., 2017).

For instance, Ould Amer et al. (Ould-Amer et al.,

2016) investigated personalised book search by con-

structing word2vec embeddings of users’ item cata-

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

290

logues. While they found that the personalised set-

ting outperformed the non-personalised setting only

when a small number of expansion terms were in-

cluded, they explained this ﬁnding with the quality

of the corpus rather than the method in general.

Recently, Bassani et al. (Bassani et al., 2023) used

contextual word embeddings for personalised QE by

aligning topical clusters of user and document embed-

dings and selecting the most appropriate terms using

a nearest-neighbour search.

Language Modelling

Following Efthimiadis’ classiﬁcation, expansion

terms can also be sourced from knowledge structures

that are not directly tied to the search process, such as

domain-speciﬁc or general thesauri. Language mod-

els, such as word embeddings, can also be considered

a form of knowledge structure.

Popular approaches for training word embeddings

include word2vec (Mikolov et al., 2013a; Mikolov

et al., 2013b) and GloVe (Pennington et al., 2014).

These vectors are derived such that words with simi-

lar meanings correspond to vectors that are also close

in the embedding space; cosine distance is a common

choice as a metric. This property also makes em-

beddings an interesting resource for candidate terms

for query expansion because words that are similar to

the original query can be chosen for expansion. Roy

et al. (Roy et al., 2016) studied different methods to

extract candidate terms from word2vec embeddings;

among others, they explored the compositionality of

word vectors to extend a query with terms similar to

a bigram decomposition of the initial query.

Large language models, such as BERT (Devlin

et al., 2019), use contextualised embeddings, where

individual words are represented by multiple vectors

depending on their context. Contextualised word em-

beddings have been shown to perform well in cap-

turing senses and performing word sense disambigua-

tion (Wiedemann et al., 2019; Loureiro et al., 2021).

Speciﬁcally, SensEmBert (Scarlini et al., 2020) fuses

BERT with semantic networks to produce multilin-

gual sense embeddings.

Naseri et al. (Naseri et al., 2021) used contextu-

alised embeddings provided by BERT for query ex-

pansion, reporting improvements in average precision

on TREC test collections compared to static mod-

els. Wang et al. (Wang et al., 2023) and Jagerman

et al. (Jagerman et al., 2023) used LLMs to gener-

ate pseudo-documents and expansion terms by utiliz-

ing the inference capabilities of these models. The

model output was subsequently appended to the orig-

inal query, with improvements in retrieval perfor-

amazon

amazon river brazil

Proﬁle Embedding

Expansion

Interaction proﬁle

User query

Expanded query

amazon

river

brazil

...

CORPUS

Figure 1: Components of the expansion system. Users’ past

interactions are recorded and represented via word embed-

dings. Any query issued by the user passes through the ex-

pansion component, where the top-k similar terms are ap-

pended.

mance across several test collections.

While large models have been shown to perform

well in word sense disambiguation, this remains an

open question for static models. However, these

smaller models are efﬁcient even in constrained set-

tings and constitute a “small world” model of a user’s

language, thus aiding explainability for downstream

tasks. To the best of our knowledge, this is the

ﬁrst study to investigate the “disambiguation power”

of personalised query expansion using local embed-

dings.

3 METHODOLOGY

The query expansion system we consider comprises

three key components: In an ofﬂine procedure, word

embeddings are trained on individual user proﬁles to

capture semantic relationships between terms within

the user’s context. The expansion component uses

these embeddings to extend the original query with

semantically similar expansion terms. Finally, the

search engine processes the query and retrieves and

ranks the documents. The components and their in-

teractions are shown in Figure 1.

This architecture enables personalised semantic

query expansion by leveraging user proﬁle informa-

tion to disambiguate short, ambiguous queries and re-

trieve more relevant results tailored to the user’s inter-

ests and context.

To better understand the extent to which PSemQE

can disambiguate queries, we addressed the following

research questions:

RQ1. For short, ambiguous queries, can PSemQE re-

trieve documents that the base query can’t?

RQ2. Are relevant documents ranked higher under

PSemQE than the base query?

PSemQE: Disambiguating Short Queries Through Personalised Semantic Query Expansion

291

RQ3. Does PSemQE yield more relevant documents

than the base query?

In the following section, we operationalise the key

concepts for our work and then document our data

collection process. In particular, we describe how we

built synthetic user proﬁles in the absence of ground

truth.

3.1 Operationalization

User Context

We broadly deﬁne user context as any data that char-

acterises the interests of a user. For this study, we

used browsing histories containing items that a user

had viewed previously. We assume that these histo-

ries implicitly describe a user’s interests and under-

standing of the domain.

Query Disambiguation

We consider a query ambiguous if it consists of one

or more words that exhibit multiple senses. Query

disambiguation is the process of inferring the sense

intended by the user and having the search results re-

ﬂect this sense.

We chose to model the problem setting on the En-

glish Wikipedia owing to the availability of ambigu-

ous lemmata along with possible disambiguations.

The titles of dedicated disambiguation pages refer

to ambiguous lemmata, and the body contains a list

of links to pages that disambiguate the term. For

the evaluation data, we selected the title of a disam-

biguation page as the query and selectively sampled

two pages from the list as the disambiguation targets.

Given a user proﬁle that consists of articles relating

to the sense of one of the targets, we regard disam-

biguation as successful if the target is present in the

search results. An illustrative example: for the query

“Amazon” and a user showing interest in geography,

ecosystems, and climate, the query is successfully

disambiguated if the results contain the target article

“Amazon rainforest”.

Synthetic User Proﬁles

While the raw Wikipedia Web request

stream (Wikipedia contributors, 2025a) contains

the requesting IP address and requested URL, this

data is considered sensitive personal information

according to the Wikimedia Foundation’s privacy

policy (Wikipedia contributors, 2025d) and is not

available to the general public. Anonymised ag-

gregated data are available in the form of pageview

dumps (Wikipedia contributors, 2025b) or the

Wikipedia Clickstream dataset (Wikipedia con-

tributors, 2025c), but these can only provide an

approximation of real users’ browsing habits. While

Arora et al. (Arora et al., 2022) argued that in certain

cases, “synthetic data is enough”, they also noted that

real user traces are useful for tracking the patterns

of individual users. We reproduced their approach

in preliminary experiments but found the proﬁles

generated in this way to be too broad in terms of

the domains of interest of the resulting proﬁles, and

no meaningful keywords were extracted from these

proﬁles. These preliminary ﬁndings suggest that to

perform meaningful personalised query expansion,

user proﬁles should revolve around well-deﬁned

topics.

Owing to the lack of publicly available user pro-

ﬁles for Wikipedia, we generated synthetic proﬁles

based on a simple knowledge graph, the process of

which is described in Section 3.2. The synthetic

proﬁles we collected consist of an unordered set of

unique page IDs of Wikipedia articles and consti-

tute a model of users with speciﬁc interests browsing

Wikipedia.

3.2 Data Collection

As a consequence of the operationalisation given in

the preceding section, we require two datasets for

our evaluation: ambiguous Wikipedia articles, along

with possible disambiguations, and user proﬁles. This

section describes the collection strategies for both

datasets.

Ambiguous Articles

We selectively sampled a set of 14 disambiguation

pages from the category listing for all article disam-

biguation pages

; we refer to these as the parent term

or simply parent in the remainder of this paper. From

each disambiguation page, we again selectively sam-

pled two pages as disambiguation targets, from which

we extracted the full-text content of the latest revi-

sion

. The raw contents were stripped from wiki-

markup

in preparation for further processing. We

collected 28 (parent,target)-tuples, two of which are

listed in Table 1.

User Proﬁles

As outlined in Section 3.1, for this study, we regard a

user proﬁle as an unordered set of Wikipedia articles

https://en.wikipedia.org/wiki/Category:

All article disambiguation pages

https://www.mediawiki.org/wiki/API:Main page

https://github.com/earwig/mwparserfromhell

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

292

Table 1: Two example parent terms and the corresponding target articles to gauge disambiguation; each target is a concrete

article linked to from the parent’s disambiguation page.

Parent ID Parent title Target ID Target title

29621629 Amazon

1701 Amazon River

90451 Amazon (company)

8239 Dylan

4637590 Bob Dylan

8741 Dylan (programming language)

t,i

m,i

c,i

extract

mention-relations

extract

category-relations

Figure 2: Construction of the knowledge graph.

with the requirement that the articles share a common

theme. In this section, we describe the process of

generating synthetic proﬁles using random walks on

a knowledge graph.

Knowledge graphs (KGs) model facts regarding a

domain using a graph structure. In KGs, nodes are

typically referred to as entities E, and the edges be-

tween them as relations R; formally, KG = (E, R).

The entities in our knowledge graph KG

wiki

consist

of Wikipedia articles, which we connect through two

types of undirected relations: r

, a mentions-relation

indicating that one article is mentioned in the body

of the other, and r

, a category-relation indicating

that the endpoints have a common Wikipedia cate-

gory. Figure 2 illustrates the iterative procedure used

to generate the knowledge graph.

Initially, KG

wiki

contained the 28 target articles as

entities (cf. Section 3.2); we refer to these as e

. In

the ﬁrst step, we identiﬁed articles mentioned in the

body of the target articles, added them to the graph

as new entities e

, and connected each to the target

in which they were mentioned through a new relation

. In the second step, for each e

, we determined

which Wikipedia category they belonged to and col-

lected further articles from these categories. These

are again added to the graph as entities e

, connected

to e

through a category-relation.

To identify mentions-relations, we used DBpedia

Spotlight(Daiber et al., 2013), a service for annotat-

ing text with Wikipedia articles that are indexed by

DBpedia. Spotlight identiﬁes surface terms in the

text that refer to DBpedia entities in a context-aware

way. For instance, the surface form “boroughs” in the

Wikipedia article on Berlin is linked to the concept

“Boroughs and neighborhoods of Berlin”, We queried

the Spotlight API using the default settings, notably

a conﬁdence setting of 0.5. For each identiﬁed DB-

pedia entity, we fetched the Wikipedia page ID and

retrieved the full text of the latest revision.

For each article obtained in this way, we extracted

the category labels from the wiki-markup and queried

the Wikipedia API for up to ten members of each cate-

gory. As before, we retrieved the full text of the latest

revision of each article.

This process iteratively builds a knowledge graph

of articles, starting with the target articles. By includ-

ing articles mentioned in the text as well as those re-

lated through mutual categories, we intend to cover

strongly related concepts alongside broader associa-

tions via categories.

The ﬁnal user proﬁles are then generated as ran-

dom walks on KG

wiki

. We generated one proﬁle per

target article, and each walk originated from the node

corresponding to the target article, terminating af-

ter a maximum of 40 steps. Because KG

wiki

is un-

weighted, each transition occurs with probability

where d denotes the degree of the current node. For

each node visited during the walk, we recorded the

Wikipedia page ID of the corresponding article. The

ID of the target article was dropped from the sequence

and recorded as an identiﬁer to retrieve each proﬁle in

the experiment. The proﬁle vectors obtained in this

way consist of Wikipedia articles that share a com-

mon theme: one article may mention another, or a re-

lated concept that is subsequently researched, or the

articles may share a common category. These proﬁles

approximate users browsing a knowledge base, whose

journeys eventually lead to the target article.

Finally, we trained a word-embedding model for

each proﬁle vector using the full text of each article.

As before, we retrieved the latest revision, removed

markup, and performed minimal pre-processing us-

ing gensim (

Reh

rek and Sojka, 2010): tokenisation

on whitespace, lower-case normalisation, removal of

punctuation characters, and pruning of tokens shorter

than two and longer than 15 characters; stop words

were removed using a stop word corpus (Bird et al.,

2009).

The collection of documents comprising a single

proﬁle was then used to train the word2vec model.

We used the default settings provided by gensim, with

the following deviations: the dimension of the embed-

PSemQE: Disambiguating Short Queries Through Personalised Semantic Query Expansion

293

ding vectors was set to 300, using a window size of

three and a minimum word frequency of one; models

were trained over seven epochs.

3.3 Experimental Design

To evaluate whether user proﬁles can be used to dis-

ambiguate search queries, we devised an experimen-

tal setting that simulates users searching a knowl-

edge base. Predeﬁned ambiguous search queries cor-

responding to the titles of Wikipedia disambiguation

pages were submitted to a search engine. The queries

were provided in two forms: the base case with no fur-

ther keywords appended and the expansion case with

additional keywords appended to the base form. Ex-

pansion keywords were retrieved from the word em-

beddings that constitute user proﬁles. For each word

in the query, the two closest neighbours in terms of

the cosine distance of their embedding vectors were

determined. Out-of-vocabulary misses were ignored,

and if none of the query words appeared in the embed-

ding, the expanded and plain queries were the same.

Let q

orig

denote the original query, ⟨w

, . . . , w

⟩ the

set of expansion terms, and ⊕ the string concatena-

tion operator. The two queries are then given by

base

= q

orig

expansion

= q

orig

⊕ ⟨w

, . . . , w

⟩

We used this setup for two distinct experiments

that differed in the size and composition of the search

corpora but otherwise followed the same protocol.

Recall from Section 3.2 that we collected 14 am-

biguous parent terms, and two articles relate to each

parent that disambiguate it, leading to a total of 28

(parent, target) tuples. In this collection, each par-

ent appears twice, with one of the two associated ar-

ticles serving as the target. For each tuple, the two

queries were constructed as given above, using the

parent-term as q

orig

and the proﬁle built around the

respective target article as the source of expansion

terms. The goal is to search the corpus for the tar-

get using q

base

and q

expansion

. An example best illus-

trates this approach: for the parent term “Amazon”,

the ﬁrst target instance is “Amazon River”; q

base

is set

to “amazon”, and based on the proﬁle around this tar-

get, q

expansion

is determined as “amazon river brazil”.

These two queries were submitted to the search en-

gine, and the results were analysed with respect to

the target instance. This process is repeated for the

second target instance, “Amazon (company)”, and so

forth, for each of the remaining tuples.

In the ﬁrst experiment, referred to as SMALL in the

remainder of this paper, the intention was to evaluate

disambiguation on corpora containing highly similar

Table 2: Setup for experiments SMALL and LARGE.

Experiment #Documents #Corpora

SMALL [2, 7] 14

LARGE 3576 1

documents. These corpora contain a small number

of documents related to one ambiguous query. They

are generated dynamically, once for each parent term,

and always contain both instances of the target arti-

cle, plus up to ﬁve other articles sampled from the

same disambiguation page, which serve as “ﬂares”.

Table 2 illustrates the corpus conﬁgurations for the

experiments. This leads to 14 distinct corpora used in

SMALL, each consisting of up to seven documents:

(target

,target

, f lare

, ..., f lare

)

For the second experiment (LARGE), we increased the

corpus size and included all articles collected during

the construction of the knowledge graph. The idea is

to mirror a real-world corpus that comprises a multi-

tude of different items. The corpus is compiled once

and reused for all subsequent queries.

The experiments are conducted using lunr.py

which uses BM25 internally to weight results. The

search results were retrieved in the order determined

by lunr and were limited to the top 100 results. In to-

tal, we collected 56 result sets (14 parents × 2 targets

× 2 query types).

3.4 Evaluation

We observed that four queries were not expanded, re-

sulting in q

base

and q

expansion

being identical. In each

of these instances, the absence of expansion terms can

be attributed to an out-of-vocabulary miss in one or

both of the word embedding models utilised during

the expansion phase. As we cannot reason about dif-

ferences in retrieval for identical base and expansion

queries, we omit these cases from the evaluation, re-

ducing the overall number of result sets from 56 to 40

by 16 (4 parents × 2 targets × 2 query types). This

resulted in 20 sets of results per experiment for eval-

uation.

After ﬁltering out unexpanded trials, the result sets

for SMALL contain a mean of 5.8 (±1.11) results for

the base case and 5.85 (±1.09) for the expansion case.

The result sets for LARGE contained 89.2 items on

average (±20.3) for the base group, and 99 items

(±4.47) for the expansion group. Figure 3 shows the

distribution of the set size for both experiments.

We measured whether the target article was part of

the result list on a binary level, as well as the numeric

https://github.com/yeraydiazdiaz/lunr.py

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

294

Figure 3: Distribution of the result set size across the two

experiments, SMALL and LARGE.

rank of the target in the list. In addition, we measured

the overlap of the sets of categories appearing in the

results with those appearing in the proﬁle.

4 RESULTS & DISCUSSION

4.1 Results

The following section documents our results, which

are discussed in Section 4.2. We report our ﬁndings

on hit rate, rank distribution, and category overlap.

This section concludes with an analysis of observed

outliers.

4.1.1 Hit Rate

Recall that we use target articles as proxies for dis-

ambiguation: they cover two of the multiple possible

senses of the parent term. Therefore, if a search can

retrieve the predetermined target, we assume that the

query has been disambiguated. However, if the target

was not retrieved, the intended meaning was not in-

ferred correctly. HitRate@K is the fraction of queries

that contained at least one relevant document over all

queries:

HitRate@K =

Hit@K

|Q|

(1)

Here, Hit@K is a binary indicator determining

presence or absence of the target article t in the set

of results R

for a query q at the cutoff K:

Hit@K =

(

1 if t ∈ R

0 otherwise

(2)

Note that each query q ∈ Q returns results, as the

search corpus always contains at least two articles (the

targets) that match the query (cf. Section 3.3). In

other words, R

in Eq. 2 is never empty.

HitRate@K

Figure 4: HitRate@K over all queries for selected K. Ow-

ing to the size of the SMALL corpus, no query returned more

than seven results.

Figure 4 shows the HitRate@K for ﬁxed values of

K ∈ {1, 3, 5, 10, 25, 50, 75, 100}. Across both experi-

ments, we found that for lower K and hence shorter

result lists, query expansion outperformed the base

case in terms of HitRate and retrieved the target ar-

ticle in more cases. As the length of the result set

increases, the rate approaches 1.0 for both base and

expansion. In other words, the longer the result list,

the more likely it is to include the target. This is un-

surprising and similar to the observation that perfect

recall can be achieved by simply returning all docu-

ments in the collection (cf. (Manning et al., 2008)).

It is important to consider the two data points in

LARGE at k = 10 and k = 25. In these cases, q

base

achieves a higher HitRate@K than q

expansion

; that

is, using only the plain query, the target is present

in more result sets. In other words, outliers exist

for which query expansion fails to retrieve the target.

Upon closer examination, we discovered that this is-

sue affected only a single query, and that the observa-

tion stemmed from the speciﬁc expansion terms em-

ployed. A more detailed analysis of this observation

is presented in Section 4.1.4.

4.1.2 Distribution of Ranks

Next, we evaluated the rank distribution of the target

articles. Previous work indicates that users tend to

regard the ﬁrst ten search results highly (Silverstein

et al., 1999; Jansen et al., 2000). Thus, a document’s

rank is an important factor in the subjective assess-

ment of relevance, and the lower the rank, the higher

the chance that the user will direct their interest to the

document.

Figure 5 shows the distribution of ranks for both

experiments across the two groups. The expansion

group tends to yield the target article at lower ranks

than the plain query.

Table 3 shows the characteristics of the distri-

PSemQE: Disambiguating Short Queries Through Personalised Semantic Query Expansion

295

Figure 5: Distribution of ranks; lower rank is better.

Table 3: Summary statistics of rank by query type, pooled

over the experiments. A Mann-Whitney U test of base pro-

ducing larger ranks than exp. was signiﬁcant at p < 0.001.

type Mean (SD) Median (IQR) Range

base 3.7 (2.2) 3.0 (2.0, 5.0) 1.0, 9.0

exp. 2.4 (4.5) 1.0 (1.0, 1.0) 1.0, 28.0

bution of target ranks across the two folds, plain

and expansion. We see that the median rank im-

proves from plain to expansion by two ranks, from

3.0 to 1.0. We conduct a one-tailed Mann-Whitney-

U test (α = 0.001) with a null hypothesis deﬁned as

plain producing ranks lower than or equal to expan-

sion. As indicated in the table, we reject the null

(p = 2.771× 10

−6

) and accept the alternative hypoth-

esis that plain produces greater ranks than expansion.

Thus, we conclude that query expansion tends to pro-

duce lower ranks for target articles than plain queries,

placing them higher on the list of results.

4.1.3 Category Overlap

Our ﬁnal analysis investigated the overall relevance of

the results, speciﬁcally whether PSemQE would re-

turn more relevant results than the base query. We

gauge relevance via Wikipedia categories: for all arti-

cles in the search corpus, we collect the assigned cate-

gory tags. Then, we determine the proportion of these

categories that are also present in the search results As

described in Section 4.1.1, we determined this overlap

for different subsets of the results. Let C

user

denote

the set of categories appearing in a user proﬁle and

the set of categories contained in the top-k re-

sults. Then, we deﬁne overlap@k as

overlap@k =

user

∩C

The mean overlap@k is shown in Figure 6. As

shown in the ﬁgure, query expansion consistently

achieves a higher overlap between the two category

Figure 6: Overlap between search result categories and user

proﬁle.

sets than the plain variant. This is true not only for

the experiment on the SMALL corpus, where the de-

gree of overlap tops off at 11% owing to the limited

size of the search corpus, but also for the LARGE cor-

pus, where query expansion starts off with a higher

overlap even for a single result (k = 1) than a plain

query. In this case, as we increased the size of the re-

sult set (k ≥ 100), the degree of overlap approached

25%. Thus, query expansion can better reproduce

users’ interests than a plain query, even for compa-

rably small result lists of 50 items or fewer.

4.1.4 Outlier Analysis

In this section, we revisit the outliers observed during

the analysis in Section 4. Evaluating HitRate@K, we

observed that on the LARGE corpus with query expan-

sion, some targets were not retrieved, even for result

lists of length 25 (cf. Figure 4). This observation is

tightly coupled with the outlier in Figure 5, where one

article is ranked at position 28 with query expansion.

The latter observation explains the former: because of

its rank, the article only appears in the result lists with

a length of 28 or greater.

The article in question is “Amazon (company)”,

and the expanded query presented to the search en-

gine was “amazon products linux”. In the respec-

tive embedding, the terms “products” (cos(θ) = 0.99)

and “linux” (cos(θ) = 0.96) were closest to “amazon”

based on their vectors’ cosine similarity. Overall, the

embedding captured the notion of “amazon” in a tech-

nology context, with terms such as developers, com-

puter, app, and startup among the 20 most similar

terms. However, the term linux appears only in one

of the 31 articles used to train the embedding: “An-

droid (operating system)”. Notably, the target article

“Amazon (company)” also does not contain the term

linux.

As mentioned in Section 3.3, the documents were

scored using the BM25 model, which is sensitive to

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

296

term frequencies. For document D and query Q =

, . . . , q

, the document-level score BM25(Q, D) is

(cf. (Manning et al., 2008))

∑

i=0

IDF(q

) ·

t f (q

, D) · (k

+ 1)

t f (q

, D) + k

· (1 − b + b

AV DL

)

Here, k

and b are tunable parameters (discussed

below), t f (q

, D) is the raw term frequency of query

word q

in document D, IDF(q

) is the inverse doc-

ument frequency of q

in the corpus, and DL and

AV DL are the current document’s length and aver-

age document length across the entire corpus, respec-

tively. The ﬁnal score for document D and query

Q = q

, . . . , q

is then computed as follows (cf. (Man-

ning et al., 2008)):

In Table 4, we compare the raw term counts and

the BM25 score for the outlier with the top three re-

sults in the set. Although the target article achieves

marginally higher BM25 scores for both product and

linux, these differences do not outweigh the absence

of linux: with a term frequency of zero, the summand

for this query word also becomes zero.

To further gauge the impact of k

, b, and t f (·)

on the rank of “Amazon (company)”, we conducted

a sensitivity analysis for each parameter. k

deter-

mines the scaling of term frequencies, with suggested

values in [1.2, 2] (Sp

arck Jones et al., 1998). At

the extremes, this parameter turns BM25 into a bi-

nary model (k

= 0), detecting only term presence, or

scales the score linearly in t f (·) (large k

) as if using

raw term frequency (Manning et al., 2008). The pa-

rameter b inﬂuences the score’s dependence on doc-

ument length and ranges from [0, 1]; for b = 0, docu-

ment length is not factored in, b = 1 fully scales the

score by document length.

Figure 7 shows the results of the parameter analy-

sis. We vary k

over [0, 20] and b over [0, 1], as well

as the raw term frequency only for the target article,

over the range [0, 41], up to the maximum term fre-

quency in the set of these four articles

. To achieve

the highest score for the target article, we would have

to set k

= 7.5, which is well above the recommended

range. Changing the document length scaling factor

b does not affect the target’s rank. However, if the

term linux appeared at least twice in the document, it

would rank among the top three results; if it appeared

at least four times, it would take the top position.

An interesting observation is that although the tar-

get article was ranked far down the list using the

expanded query, the search results nevertheless con-

tained more relevant results in terms of category over-

Varying the term frequency also inﬂuences the term’s

IDF; as the difference is only on the order of 1 × 10

−2

, we

omit further analysis of this effect.

Figure 7: BM25 score analysis for k

, b, and term counts of

linux. The line corresponding to the target-article is high-

lighted; dashed lines indicate the threshold at which the

target-article score exceeds the others’, dotted lines indicate

the values of k

and b set for the experiments.

lap than the base query, especially for ranks 1 through

3 and up to 10, as illustrated in Figure 8.

In summary, we found that the target article was

not retrieved among the top results because of the

expansion with a keyword that was not relevant to

this document. Altering the tunable parameter k

the BM25 model to scale the remaining query-term

frequencies resolves this issue, although the conse-

quences for other queries are unclear. A more viable

approach would be to not augment the query with ir-

relevant keywords, although “irrelevant” is subjective

in this setting: we only know that the keyword is ir-

relevant because we know the target beforehand. A

better approach would be to further reﬁne the mod-

els used for the proﬁle embeddings by, for example,

pruning words from the vocabulary that occur below

a given threshold or shrinking the window size of the

word2vec-algorithm to avoid loose associations be-

tween terms.

PSemQE: Disambiguating Short Queries Through Personalised Semantic Query Expansion

297

Table 4: Raw term counts and BM25 scores for the outlier article, compared to the top three results.

term counts BM25

rank title product amazon linux product amazon linux Σ

1 Tablet computer 15 11 13 0.87 2.67 4.13 7.67

2 Android (operating system) 9 10 41 0.66 2.24 4.47 7.36

3 Audible (service) 8 8 1 0.92 3.00 2.63 6.55

28 Amazon (company) 48 167 0 0.98 3.30 0.00 4.29

Figure 8: Category overlap in the search for “Amazon

(company)”.

4.2 Discussion

In RQ1, we asked whether query expansion could

be used to disambiguate queries exhibiting multiple

senses and return results that the base query would

not reveal. To answer this question, we measured suc-

cessful disambiguation by retrieving a predeﬁned tar-

get article. We found that both the plain query and the

expanded version returned the target article at least

within the ﬁrst ten results. Therefore, both approaches

yield the intended sense at some position in the result

list, and PSemQE only outperforms on top-k lists for

small k.

RQ2 then asked whether query expansion could

rank relevant documents higher. Our analysis of ranks

showed that personalised semantic query expansion

has a signiﬁcant tendency to produce lower ranks, that

is, it places the article higher in the list. Thus, per-

sonalised semantic query expansion is a supportive

measure for inferring a user’s intended sense based

on their prior interactions with the system.

Finally, in RQ3, we asked whether PSemQE

could produce results that are more relevant to a user

than a simple query. We measured relevance via cat-

egories assigned to articles in terms of the proportion

of categories from a user proﬁle that were present in

the search results. We observed that query expan-

sion can result in more of the user’s categories being

present in the result list, especially in the long tail of

50 results or more. This may indicate that as users

browse through more items through pagination, lazy

loading, or other measures, the relevance of the results

does not necessarily degrade.

In summary, we conclude that personalised query

expansion can be a supportive aid in disambiguating

queries to the intended sense and, overall, can re-

turn more potentially relevant documents than a plain

query would.

4.3 Limitations

The approach we explored has two limitations: First,

we rely on synthetic user proﬁles that we generated to

model a narrowly focused information need; real user

proﬁles may not be as strictly focused on a single do-

main. However, as long as there is a dominant con-

text that allows the framing of ambiguous terms, the

proﬁle should be suitable for this approach. We be-

lieve this to be true, especially for libraries, archives,

and research databases, which are often consulted by

users with well-deﬁned research interests.

Second, during proﬁle generation, we controlled

the size of the local proﬁle corpus. For users whose

proﬁles are sparse or, in the case of new users, entirely

empty, no meaningful word embeddings are available.

This sensitivity to cold start is inherent in any person-

alisation facet of information retrieval, such as recom-

mender systems. In these cases, it may be desirable

to focus on interactive query disambiguation, detect-

ing ambiguity via, for example, thesauri or measures

such as clarity (Croft et al., 2001; Cronen-Townsend

and Croft, 2002). Additionally, users with sparse pro-

ﬁles may be prompted to provide seed documents that

have been relevant to them in the past, in the form of

short excerpts or abstracts.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

298

5 CONCLUSION

Query expansion is an important component of in-

formation retrieval (IR) systems. It allows users to

bridge the query-document vocabulary gap without

having to invest mental effort in framing their infor-

mation needs in the context of the queried collection.

This study demonstrated the potential of Person-

alised Semantic Query Expansion (PSemQE) as an

effective approach for disambiguating short, ambigu-

ous queries in IR systems. The experimental re-

sults showed that PSemQE could successfully re-

trieve relevant documents that matched the intended

sense of ambiguous terms, often ranking them higher

than a base query without expansion. Further-

more, PSemQE consistently produced result sets with

higher relevance to user interests, as measured by the

category overlap between search results and user pro-

ﬁles.

Although PSemQE shows promise in improving

search relevance and user experience, some limita-

tions must be considered. Reliance on synthetic user

proﬁles focused on narrow domains may not fully rep-

resent real-world user behaviour. Additionally, this

approach may face challenges with new or inactive

users owing to insufﬁcient proﬁle data for meaningful

word embeddings.

Further research is required to explore ways to ad-

dress these limitations, such as investigating methods

for building more diverse synthetic proﬁles when no

real-world data are available, developing strategies for

cold-start scenarios, and incorporating interactive dis-

ambiguation techniques. Overall, this study shows

that systems for personalised query expansion can act

as disambiguation engines when they integrate user

information and reliably return meaningful responses

for short, ambiguous queries.

DATA AVAILABILITY

The code to reproduce our experiments is published

at https://github.com/baumanno/psqe-2024.

Data is published via Zenodo at https://doi.org/10.

5281/zenodo.10729730.

ACKNOWLEDGMENTS

This article is the outcome of research conducted

within the Africa Multiple Cluster of Excellence

at the University of Bayreuth, funded by the

Deutsche Forschungsgemeinschaft (DFG, German

Research Foundation) under Germany’s Excellence

Strategy – EXC 2052/1 – 390713894.

REFERENCES

Arora, A., Gerlach, M., Piccardi, T., Garc

ıa-Dur

an, A.,

and West, R. (2022). Wikipedia reader navigation:

When synthetic data is enough. In Proceedings of

the Fifteenth ACM International Conference on Web

Search and Data Mining, WSDM ’22, page 16–26,

New York, NY, USA. ACM.

Bassani, E., Tonellotto, N., and Pasi, G. (2023). Personal-

ized Query Expansion with Contextual Word Embed-

dings. ACM Trans. Inf. Syst., 42(2):61:1–61:35.

Bird, S., Loper, E., and Klein, E. (2009). Natural Language

Processing with Python. O’Reilly.

Bouadjenek, M. R., Hacid, H., and Bouzeghoub, M. (2019).

Personalized social query expansion using social an-

notations. In Hameurlain, A., Wagner, R., Morvan, F.,

and Tamine, L., editors, Transactions on Large-Scale

Data- and Knowledge-Centered Systems XL, pages 1–

25, Berlin, Heidelberg. Springer Berlin Heidelberg.

Chirita, P. A., Firan, C. S., and Nejdl, W. (2007). Person-

alized query expansion for the web. In Proceedings

of the 30th Annual International ACM SIGIR Con-

ference on Research and Development in Informa-

tion Retrieval, SIGIR ’07, page 7–14, New York, NY,

USA. Association for Computing Machinery.

Croft, W. B., Cronen-Townsend, S., and Lavrenko, V.

(2001). Relevance feedback and personalization: A

language modeling perspective. In DELOS Workshops

/ Conferences.

Cronen-Townsend, S. and Croft, W. B. (2002). Quantifying

query ambiguity. In Proceedings of the Second In-

ternational Conference on Human Language Technol-

ogy Research, HLT ’02, page 104–109, San Francisco,

CA, USA. Morgan Kaufmann Publishers Inc.

Daiber, J., Jakob, M., Hokamp, C., and Mendes, P. N.

(2013). Improving efﬁciency and accuracy in mul-

tilingual entity extraction. In Proceedings of the

International Conference on Semantic Systems, I-

SEMANTICS ’13, page 121–124, New York, NY,

USA. ACM.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova,

K. (2019). BERT: Pre-training of Deep Bidirec-

tional Transformers for Language Understanding. In

Burstein, J., Doran, C., and Solorio, T., editors, Pro-

ceedings of the 2019 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, Volume

1 (Long and Short Papers), pages 4171–4186, Min-

neapolis, Minnesota. Association for Computational

Linguistics.

Efthimiadis, E. N. (1996). Query expansion. In Williams,

M. E., editor, Annual Review of Information Systems

and Technology (ARIST), volume 31, pages 121–187.

Jagerman, R., Zhuang, H., Qin, Z., Wang, X., and Bender-

sky, M. (2023). Query expansion by prompting large

PSemQE: Disambiguating Short Queries Through Personalised Semantic Query Expansion

299

language models. In Gen-IR@SIGIR 2023: The First

Workshop on Generative Information Retrieval.

Jansen, B. J., Spink, A., and Saracevic, T. (2000). Real

life, real users, and real needs: A study and analysis

of user queries on the web. Information Processing &

Management, 36(2):207–227.

Krovetz, R. and Croft, W. B. (1992). Lexical ambigu-

ity and information retrieval. ACM Trans. Inf. Syst.,

10(2):115–141.

Loureiro, D., Rezaee, K., Pilehvar, M. T., and Camacho-

Collados, J. (2021). Analysis and evaluation of lan-

guage models for word sense disambiguation. Inter-

national Conference on Computational Logic.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).

Efﬁcient Estimation of Word Representations in Vec-

tor Space. In Bengio, Y. and LeCun, Y., editors, 1st In-

ternational Conference on Learning Representations,

ICLR 2013, Scottsdale, Arizona, USA.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean,

J. (2013b). Distributed Representations of Words and

Phrases and their Compositionality. In Proceedings of

the 27th International Conference on Neural Informa-

tion Processing Systems - Volume 2, NIPS’13, page

3111–3119, Red Hook, NY, USA. Curran Associates

Inc.

Naseri, S., Dalton, J., Yates, A., and Allan, J. (2021).

CEQE: Contextualized Embeddings for Query Ex-

pansion. In Hiemstra, D., Moens, M.-F., Mothe, J.,

Perego, R., Potthast, M., and Sebastiani, F., editors,

Advances in Information Retrieval, Lecture Notes in

Computer Science, pages 467–482, Cham. Springer

International Publishing.

Ould-Amer, N., Mulhem, P., and G

ery, M. (2016). To-

ward Word Embedding for Personalized Information

Retrieval. In Neu-IR: The SIGIR 2016 Workshop on

Neural Information Retrieval.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe:

Global Vectors for Word Representation. In Pro-

ceedings of the 2014 Conference on Empirical Meth-

ods in Natural Language Processing (EMNLP), pages

1532–1543, Doha, Qatar. Association for Computa-

tional Linguistics.

Reh

rek, R. and Sojka, P. (2010). Software Framework

for Topic Modelling with Large Corpora. In Proceed-

ings of the LREC 2010 Workshop on New Challenges

for NLP Frameworks, pages 45–50, Valletta, Malta.

ELRA.

Roy, D., Paul, D., Mitra, M., and Garain, U. (2016). Using

Word Embeddings for Automatic Query Expansion.

In Neu-IR: The SIGIR 2016 Workshop on Neural In-

formation Retrieval.

Scarlini, B., Pasini, T., and Navigli, R. (2020). Sensembert:

Context-enhanced sense embeddings for multilingual

word sense disambiguation. AAAI Conference on Ar-

tiﬁcial Intelligence.

Silverstein, C., Marais, H., Henzinger, M., and Moricz, M.

(1999). Analysis of a very large web search engine

query log. ACM SIGIR Forum, 33(1):6–12.

arck Jones, K., Walker, S., and Robertson, S. (1998). A

probabilistic model of information and retrieval: De-

velopment and status. Technical Report UCAM-CL-

TR-446, University of Cambridge, Computer Labora-

tory.

Wang, L., Yang, N., and Wei, F. (2023). Query2doc: Query

Expansion with Large Language Models. In Bouamor,

H., Pino, J., and Bali, K., editors, Proceedings of the

2023 Conference on Empirical Methods in Natural

Language Processing, pages 9414–9423, Singapore.

Association for Computational Linguistics.

Wiedemann, G., Remus, S., Chawla, A., and Biemann,

C. (2019). Does bert make any sense? inter-

pretable word sense disambiguation with contextual-

ized embeddings. In Proceedings of the 15th Con-

ference on Natural Language Processing (KONVENS

2019): Long Papers, pages 161–170, Erlangen, Ger-

many. German Society for Computational Linguistics

& Language Technology.

Wikipedia contributors (2025a). Data Platform/Data

Lake/Trafﬁc/Webrequest — Wikitech. [Online; ac-

cessed 25

January 2025].

Wikipedia contributors (2025b). Research:Page view —

Wikimedia Meta-Wiki. [Online; accessed 25

Jan-

uary 2025].

Wikipedia contributors (2025c). Research:Wikipedia Click-

stream — Wikimedia Meta-Wiki. [Online; accessed

January 2025].

Wikipedia contributors (2025d). Wikimedia Foundation

Wiki. [Online; accessed 25

January 2025].

Zhou, D., Wu, X., Zhao, W., Lawless, S., and Liu, J. (2017).

Query expansion with enriched user proﬁles for per-

sonalized search utilizing folksonomy data. IEEE

Transactions on Knowledge and Data Engineering,

29(7):1536–1548.

WEBIST 2025 - 21st International Conference on Web Information Systems and Technologies

300